Exploring data is one of the most important steps in data analytics. It is also the step that more time can take (if a project takes ten months, six months will go to be cleaning and exploring the database) (Forbes, 2016). Besides, it is a phase that many avoid or try to minimize. Well, at least that’s what I perceive, that analysts, managers, project managers, and similar positions, want to reach the models, run them and get results quickly. However, would you marry someone without knowing them first?
Probably not, and if on the results of the models you are going to take the strategic and crucial decisions for the success of your organization (for example, whether or not you launch a product, if you buy more or fewer inputs according to the forecasts you have made, or if you hire an individual or not), then you should devote time to know the data for a while before “marrying” them through a model. The above is even more crucial if you are going to use sophisticated methods and models; the exploration, through the use of descriptive statistics, gives you the certainty that the selected models are capturing the essence of your data.
I love this phase because when we explore the data, we find patterns, trends, outliers, anomalies and this turns out to be a fundamental step for the cleanup of the data and gives us ideas of how we can use it.
Due to its importance, we will dedicate several posts to this topic. Today, in particular, we will discuss the types of data. We still do not talk about the exploration itself because we first identify the kind of data we have and then we look for the right tool to explore them. Think that if the data is a lock, then any key cannot open it and unleash the information behind it; First, we have to find out what kind of lock is.
Type of data
The fundamental division divides the data into numerical vs. categorical. The first ones are stored in a “natural” way as numbers and can be “measured”; whereas the categorical ones are classes or categories and can not be “measured.”
The numerical data are divided into continuous or discrete data. If you see a decimal, then it is a continuous numerical data; whereas, if you see a whole number, then it is discrete. To the above, I would add that things can get a little more complicated because we must always take into account the context of the data. For example, suppose that you make and sell clothes for children under 12, and the sales department informs you that according to the forecast (using an ARIMA model), it is expected that next month 200.5 shirts for children will be sold. The variable “number of shirts” is of a discrete nature, although the forecast has a decimal.
Data Levels of Measurement
Qualitative variables can be “measured” nominally or ordinally. Nominal refers to the fact that we can classify the data into groups that do not follow a logical order. For example, hair color is a nominal variable, whose categories can be: blond, brown, black, white, red and other. What does it mean that they do not follow a logical order? That one can not order the categories on a scale and say that blond hair is better than brown hair and that this is better than black hair, which in turn is better than red hair, and at last, there is white hair.
Also, the categories have to meet two conditions: they must be mutually exclusive and collectively exhaustive. The above means that individuals should only appear in one group, for example, it is not possible for a woman to have black and blond hair simultaneously. Either it’s black, or it’s blond, but not both at the same time (that’s mutually exclusive). I know that there are women who have black hair and dye the tips of blond color or make streaks, then what should you do? Everything depends on the objective of your research, that is, what you seek to answer with the data that is collected. After being clear about that, you can decide if you need to change your question (what color is your natural hair?), expand the response categories of your original question, or you can even leave the question and response categories unchanged.
Ahora, colectivamente exhaustivo se refiere a que por lo menos uno de los eventos, o de las categorías, debe ocurrir. Por ese motivo muchas veces usted ve la opción otro, seguida de la pregunta cuál. A veces, aunque realicemos pilotos de las encuestas, no podemos estar seguros de que estamos incluyendo absolutamente todos los posibles casos y la categoría “otro” es la que nos salva. Por ejemplo, suponga que usted pregunta a un hombre que es 100% calvo: ¿de qué color es su cabello?, y las posibles respuestas son: rubio, castaño, negro, blanco, rojo. La respuesta otro con la posibilidad de escribir ¿cuál? es su salvación.
Now, collectively exhaustive refers to the fact that at least one of the events, or of the categories, must occur. For that reason many times you see the option “other”, followed by the question which one. Sometimes, even if we conduct survey pilots, we can not be sure that we are including absolutely all possible cases and the category “other” is what saves us. For example, suppose you ask a man who is 100% bald: what color is your hair? And the possible answers are blond, brown, black, white and red. The “other” category (with the possibility of writing which one) is your salvation.
In the next post, we will talk about the measurement levels of quantitative variables: interval and ratio. Also, I will leave you an exercise to strengthen these concepts before introducing some exploration tools.
- Forbes (2016). Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. Disponible en https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#57e6bbe76f63