One of the most important things you can do when approaching a data science project is really understand the dataset you’re working with as a first step. Without a proper data exploration process in place, it becomes much more challenging to identify critical issues or successfully carry out a deeper analysis of the dataset.

Exploratory Data Analysis (EDA) in Data Science is a step in the analysis process that uses several techniques to visualize, analyze, and find patterns in the data. John Turkey, who developed the EDA method, likened it to detective work because you have to dig for clues and evidence before making any assumptions about the outcome.

A complete and solid EDA can help identify issues in your data like missing or wrong values, typos, and anomalies (outliers). In addition, you will learn about the distribution of the data, the relationship between variables, and find variables that may not affect the desired outcome. In this article, we’ll explore the principle techniques of Exploratory Data Analysis, tools, and graphs that help to understand the data better so you can ultimately answer business questions and find insights that may surprise your stakeholders.

When you start to explore the dataset, the first thing you have to evaluate is the attributes of the data you’re working on. Understanding the type of each variable will help you in the process of choosing the proper technique for the attribute analysis.

**Quantitative Data:**These are numerical values and can either be discrete or continuous. Discrete means a finite number, like the number of children in a family, for example. Continuous, on the other hand, represents infinite numbers, like a person’s weight.**Qualitative Data:**These are categorical values, which divide into two subtypes: nominal and ordinal. Nominal data are categorical values with no order relation, like gender and color. Ordinal data means that an order exists within the categories, but the distances between them are unknown like a person’s level of education. It’s also worth noting that discrete data can sometimes be transformed into ordinal data, for example, grouping ages in ranges.

With this understanding of the basics, you can then perform Univariate Analysis.

Univariate Analysis is the simplest form of data analysis. 'Uni' refers to analyzing one individual attribute to understand the position of the data in the dataset by the central tendency measures and the sparseness of that data by the dispersion measures.

The central tendency measures use the following:

**Mode:**The most frequent value in that attribute. It's possible to have no mode (flattened distribution), one mode (unimodal distribution), or more than one mode (multimodal distribution). Also, this measure can be used either by qualitative or quantitative data, but qualitative data uses it more often.

**Median:**The central value of the ascending ordered data. This measure is used by quantitative data and is very resistant to outliers.

**Mean:**The most known and used measure. It represents the sum of all values divided by the total number of values. This type of measure is used by quantitative data, and it's very sensitive to outliers.

**Quantile:**Divides ordered data in nearly equal sizes. The quantile 50% represents the median, and one of the most common quantiles is called quartile, which splits the data into four parts of 25% each. However, the position measurements are not sufficient to characterize the distribution of data, and therefore we need to analyze dispersion measures.

Before jumping to the dispersion measures, let's look at an example of why it’s worth doing. Imagine that we have a set of employee salary ranges with the following values:

**[3200, 3900, 3400, 3500] **

The mean of this dataset is 3500. But, you can achieve the same mean with a set of different values. For example:

**[500, 3500, 3000, 7000] **

Again, the mean is 3500 even though the data is more concentrated in the first example and more sparse in the second example. This is what the dispersion measures will show us. Let's explore this further to understand what it is and how it works.

The dispersion measures show the variability of the data set, and when analyzed together with the position measures, it gives a big picture of how the data is distributed. To understand the sparseness of the data, we will look at the following measures: variance, standard deviation, amplitude, interquartile range, and coefficient of variation.

**Variance:**This indicates how far the values are from the expected value.

**Standard deviation:**This is the square root of the variance and expresses the degree of dispersion of the dataset. When you have a low standard deviation, the values tend to be close to the expected value, but those values are spread over a wider range when it's high. The standard deviation value can be read in the same scale of the original observation, while the variation is in the squared scale.

**The amplitude (range):**This is the difference between the largest and smallest value in the data, and it's useful only to give a rough idea of the range that the observations fall. This measure is sensitive to outliers and doesn't provide much information because it doesn't use all the values.

**Interquartile range:**A more robust measure is the interquartile range (IQR), which is the difference between the third quartile and the first quartile. It's also not sensitive to outliers because the extreme values are ignored. The outcome of this measure describes the middle 50% of observations of the dataset. The larger this value is, the more spread out the data.

**Coefficient of variation:**The last measure of dispersion to introduce is the coefficient of variation and is also called the relative standard deviation. This measure is the ratio of the standard deviation to the mean and represents how much the observations vary in relation to the mean. It's a valuable measure when we want to compare two attributes on a different scale because it expresses the variability of the data, excluding the influence of the order of the variable's magnitude.

With the understanding of these measures, you can analyze the attributes individually and get some insights. Next, let's learn how to do an analysis using more than one variable.

After analyzing the attributes individually, the next step is to understand the relationship between them. With this analysis, you can get some relevant insights, verify the degree of correlation of those variables, and bring valuable information for the project's subsequent phases. Let’s take a look at some of **the correlation coefficients**. The correlation coefficient is any statistical relationship (causal or non-causal) between two variables. The values can go from -1 to 1, and the value in the extreme means that we have a high correlation between both variables. When the value is positive, both variables go in the same direction, and when it is negative, one increases, and the other decreases.

Pearson Correlation is one of the most common measures and analyzes how the variables are linearly related. This measure is very sensitive to outliers and is used only for quantitative data (which is why it's essential to understand the type of each variable at the beginning of the analysis). Depending on the type of data, you will need to use different techniques. Suppose you analyze two quantitative variables from your dataset, and the result of the Pearson Correlation is around 0 (which means that there is no linear correlation). In that case, it doesn't mean that those variables are not correlated. It means they are not linearly correlated but can be non-linearly correlated.

Figure 1 shows some examples of how the data looks with some of the values from Pearson's Correlation.

Another measure is the Spearman Correlation, which is similar to the Pearson Correlation, but it considers the order of the data instead. It's also used for quantitative data but can be used for ordinal qualitative variables since this technique uses ranks in its calculations. So you can assign ranks to ordinal attributes. The Spearman Correlation assesses the monotonic relationship between two continuous or ordinal variables. It is not sensitive to asymmetries in the distribution or the presence of outliers since we consider the order and not the values of the variables. There are other rank correlation coefficients besides Spearman, such as Kendall rank correlation. As we can see in figure 2, there is a comparison between Spearman and Pearson Correlation that shows a perfect correlation for Spearman, but not for Pearson. The reason is that the data is monotonic, not linear.

The details of each technique go beyond this article's scope, but keep in mind that it's important to check the variable's type first and then choose the proper method for it. If you pick two quantitative variables, you can choose between Pearson Correlation or Spearman Correlation. If you choose two qualitative variables and they are ordinal, you can use any rank correlation technique. If they are nominal, you can use Chi-Square, Cramers V, or Goodman Kruskal's lambda. If you are using one quantitative and one qualitative variable, then Point-biserial correlation or the Logistic Regression. Of course, these recommendations are just high-level ideas. The choice will depend on the distributions of the data, the size of the dataset, and the objective of the analysis.

Another way to analyze the data is with graphs and tables. It’s important to note that the variable's type will determine the best chart to use. The following are some of the most frequent graphs used for quantitative and qualitative attributes to give you an idea.

**Frequency Tables:** Frequency tables summarize the data information into absolute or relative frequencies of each value (category). You can use them for qualitative and discrete data.

**Bar Charts:** You can also use bar charts to represent this frequency table in graphical form. This kind of graph is usually used to show the individual values of each column (feature/attributes) and make some comparison among all columns, identifying the ones with highest and lowest values.

**Pareto Plots:** Another useful chart for this type of variable is the **Pareto plot**, which helps identify the top issues that account for most of the problems based on the Pareto principle. This graph is based on a bar chart and line chart, where the latter represents the cumulative frequency (see figure 3).

**Pie Charts:** Another typical graph used for this type of variable is the Pie chart. However, according to *Storytelling with Data* author Cole Knaflic, you should avoid using pie charts because they can be hard to read and difficult to tell which slice is bigger (and by how much) when they are very similar in size. As an alternative to using pie charts, you can use horizontal bar charts with ordered data. However, to see the relationship of one part to the whole, the pie chart is a better choice.

**Contingency Tables:** One way to visualize the relationship of two qualitative data is through the contingency table, which displays the frequency distribution of the given attributes in a matrix.

**Histograms**: Histograms are graphical representations used to display the distribution of continuous data. The graph shows the position of the mean, median, dispersion, number of peaks, and more (see figure 4) and it’s useful to visualize the Univariate analysis we performed previously.

**Line Plots: **This chart is based on quantitative data and is often used to represent the data over a period of time. For example, if we want to understand the number of people affected by the Covid over time, the y-axis would represent the quantitative data about the number of people affected, and the x-axis would represent the date over a time period.

**Box Plots:** Another powerful graph is the box plot, and it's usually challenging to understand when it's the first time you see it. However, after understanding the nuances of that chart (see figure 6), you will easily observe the median, interquartile range, dispersion of the data, asymmetry, and the discrepant values, also known as outliers. But before understanding the outliers concept, let's see some ways to represent the multivariate analysis.

**Scatterplots**: Scatterplots are one of the most used graphs to compare the correlation of two quantitative variables. The y-axis represents one variable and the x-axis another.

**Heat Maps:** The heat map is another data visualization that is widely used, where a color represents the individual values of the matrix. Usually, intense colors represent higher values. There are many ways to use heat maps. For instance, you can apply the Pearson Correlation of all quantitative attributes from a dataset (see figure 7) to understand what variables have a high correlation. After, you can confirm with the scatterplot.

The last critical analysis to explore is the detection of outliers. Outliers are examples within the range of possible values that fall outside most instances in the dataset. There are three kinds of outliers: global, contextual, and collective.

**Global:**Global outliers are data points that occur far outside of most of the data. The simplest way to identify them is in the box plot - dots that appear in the chart are considered anomalous data.

**Contextual:**Data points that are considered contextual outliers are those that are not individually outliers (like global outliers) but when observed in certain contexts are. For example, let's say that babies born between 38 and 42 weeks are of normal size if they range from 5.5 to 8.8 pounds. If a baby is born at 37 weeks and is 8.6 Lbs, then we have a contextual outlier. Looking only at the size, it’s not an outlier, but when adding in the context of weeks, the baby deviates from the majority of babies.

**Collective:**Collective outliers are sets of data that deviate from the rest of the data set when looked at collectively, but are not considered global or contextual outliers. For example, if we observe the birth of male and female babies each month, and only male babies are born in a specific month, then it’s a collective outlier.

There are many ways to automate the process of detecting outliers, using either statistical data science methods with standard deviations, interquartile range, normal distribution, or by some machine learning methods, like clustering. Even though these techniques help automate this process, I encourage you to do the manual analysis before automating it in order to guarantee that the data and the method you plan to use will remove the outliers properly. We looked at the different ways to find outliers in the data, which plays an essential role in the process. As we saw previously, some measures are highly affected by outliers. Some models are very sensitive to outliers, while others are not. So depending on the model in use, it’s necessary to treat the data correctly.

As we’ve discussed, there are many techniques and ways to do Exploratory Data Analysis to understand how the data is structured. The EDA is a manual process that helps you fully understand the data, its distribution, and the presence or absence of outliers. However, some packages automate the EDA for you. For those who know Python, the two most common libraries are pandas-profiling and sweetviz. These tools provide helpful information about the data and cover some of the topics discussed in this article. Nevertheless, I strongly recommend only using tools like these as an initial analysis to get an overview of the data before applying what is covered here manually.

To summarize, the Exploratory Data Analysis phase helps you gain more knowledge about the domain and prevent potential issues from occurring. Once the analysis is complete, you will find that some data may be fixed or ignored. The model will not be built using all variables, either because some variables are irrelevant or because one variable has a high correlation with another. Additionally, you will know in advance the distribution of your data and the existence or non-existence of outliers, which can significantly impact the model you choose. With that, you will understand the data's patterns and trends and gain valuable insights.

Also, it's essential to highlight all findings you noticed from the Exploratory Data Analysis, the concerns and ideas for the given project that you got from the analysis, and all the work you will need to do in the preprocessing pipeline to train your model with adequate data. For those who want to get deeper into this topic, here are some book suggestions:

- Exploratory Data Analysis. John W.Tukey, 1977.
- Hands-On Exploratory Data Analysis with Python: Perform EDA techniques to understand, summarize, and investigate your data. Usman Ahmed & Suresh K. Mukhiya, 2020.
- Storytelling with Data: A Data Visualization Guide for Business Professionals. Cole N. Knaflic, 2015.

As a final note, keep in mind that this is a cyclic process that a data scientist should apply more than once in most cases. It's important to frequently do this analysis with the data, even if you've done it in the past, because the data may change in the future, and new patterns, trends, and distributions can arise.

**Are you looking for help with your next software project?**

You've come to the right place. Our technical recruitment team has been carefully handpicked. Contact us and we’ll have your team up and running in no time.

The post A Complete Introduction to Exploratory Data Analysis appeared first on Scalable Path.

]]>One of the most important things you can do when approaching a data science project is really understand the dataset you’re working with as a first step. Without a proper data exploration process in place, it becomes much more challenging to identify critical issues or successfully carry out a deeper analysis of the dataset.

Exploratory Data Analysis (EDA) in Data Science is a step in the analysis process that uses several techniques to visualize, analyze, and find patterns in the data. John Turkey, who developed the EDA method, likened it to detective work because you have to dig for clues and evidence before making any assumptions about the outcome.

A complete and solid EDA can help identify issues in your data like missing or wrong values, typos, and anomalies (outliers). In addition, you will learn about the distribution of the data, the relationship between variables, and find variables that may not affect the desired outcome. In this article, we’ll explore the principle techniques of Exploratory Data Analysis, tools, and graphs that help to understand the data better so you can ultimately answer business questions and find insights that may surprise your stakeholders.

When you start to explore the dataset, the first thing you have to evaluate is the attributes of the data you’re working on. Understanding the type of each variable will help you in the process of choosing the proper technique for the attribute analysis.

**Quantitative Data:**These are numerical values and can either be discrete or continuous. Discrete means a finite number, like the number of children in a family, for example. Continuous, on the other hand, represents infinite numbers, like a person’s weight.**Qualitative Data:**These are categorical values, which divide into two subtypes: nominal and ordinal. Nominal data are categorical values with no order relation, like gender and color. Ordinal data means that an order exists within the categories, but the distances between them are unknown like a person’s level of education. It’s also worth noting that discrete data can sometimes be transformed into ordinal data, for example, grouping ages in ranges.

With this understanding of the basics, you can then perform Univariate Analysis.

Univariate Analysis is the simplest form of data analysis. 'Uni' refers to analyzing one individual attribute to understand the position of the data in the dataset by the central tendency measures and the sparseness of that data by the dispersion measures.

The central tendency measures use the following:

**Mode:**The most frequent value in that attribute. It's possible to have no mode (flattened distribution), one mode (unimodal distribution), or more than one mode (multimodal distribution). Also, this measure can be used either by qualitative or quantitative data, but qualitative data uses it more often.

**Median:**The central value of the ascending ordered data. This measure is used by quantitative data and is very resistant to outliers.

**Mean:**The most known and used measure. It represents the sum of all values divided by the total number of values. This type of measure is used by quantitative data, and it's very sensitive to outliers.

**Quantile:**Divides ordered data in nearly equal sizes. The quantile 50% represents the median, and one of the most common quantiles is called quartile, which splits the data into four parts of 25% each. However, the position measurements are not sufficient to characterize the distribution of data, and therefore we need to analyze dispersion measures.

Before jumping to the dispersion measures, let's look at an example of why it’s worth doing. Imagine that we have a set of employee salary ranges with the following values:

**[3200, 3900, 3400, 3500] **

The mean of this dataset is 3500. But, you can achieve the same mean with a set of different values. For example:

**[500, 3500, 3000, 7000] **

Again, the mean is 3500 even though the data is more concentrated in the first example and more sparse in the second example. This is what the dispersion measures will show us. Let's explore this further to understand what it is and how it works.

The dispersion measures show the variability of the data set, and when analyzed together with the position measures, it gives a big picture of how the data is distributed. To understand the sparseness of the data, we will look at the following measures: variance, standard deviation, amplitude, interquartile range, and coefficient of variation.

**Variance:**This indicates how far the values are from the expected value.

**Standard deviation:**This is the square root of the variance and expresses the degree of dispersion of the dataset. When you have a low standard deviation, the values tend to be close to the expected value, but those values are spread over a wider range when it's high. The standard deviation value can be read in the same scale of the original observation, while the variation is in the squared scale.

**The amplitude (range):**This is the difference between the largest and smallest value in the data, and it's useful only to give a rough idea of the range that the observations fall. This measure is sensitive to outliers and doesn't provide much information because it doesn't use all the values.

**Interquartile range:**A more robust measure is the interquartile range (IQR), which is the difference between the third quartile and the first quartile. It's also not sensitive to outliers because the extreme values are ignored. The outcome of this measure describes the middle 50% of observations of the dataset. The larger this value is, the more spread out the data.

**Coefficient of variation:**The last measure of dispersion to introduce is the coefficient of variation and is also called the relative standard deviation. This measure is the ratio of the standard deviation to the mean and represents how much the observations vary in relation to the mean. It's a valuable measure when we want to compare two attributes on a different scale because it expresses the variability of the data, excluding the influence of the order of the variable's magnitude.

With the understanding of these measures, you can analyze the attributes individually and get some insights. Next, let's learn how to do an analysis using more than one variable.

After analyzing the attributes individually, the next step is to understand the relationship between them. With this analysis, you can get some relevant insights, verify the degree of correlation of those variables, and bring valuable information for the project's subsequent phases. Let’s take a look at some of **the correlation coefficients**. The correlation coefficient is any statistical relationship (causal or non-causal) between two variables. The values can go from -1 to 1, and the value in the extreme means that we have a high correlation between both variables. When the value is positive, both variables go in the same direction, and when it is negative, one increases, and the other decreases.

Pearson Correlation is one of the most common measures and analyzes how the variables are linearly related. This measure is very sensitive to outliers and is used only for quantitative data (which is why it's essential to understand the type of each variable at the beginning of the analysis). Depending on the type of data, you will need to use different techniques. Suppose you analyze two quantitative variables from your dataset, and the result of the Pearson Correlation is around 0 (which means that there is no linear correlation). In that case, it doesn't mean that those variables are not correlated. It means they are not linearly correlated but can be non-linearly correlated.

Figure 1 shows some examples of how the data looks with some of the values from Pearson's Correlation.

Another measure is the Spearman Correlation, which is similar to the Pearson Correlation, but it considers the order of the data instead. It's also used for quantitative data but can be used for ordinal qualitative variables since this technique uses ranks in its calculations. So you can assign ranks to ordinal attributes. The Spearman Correlation assesses the monotonic relationship between two continuous or ordinal variables. It is not sensitive to asymmetries in the distribution or the presence of outliers since we consider the order and not the values of the variables. There are other rank correlation coefficients besides Spearman, such as Kendall rank correlation. As we can see in figure 2, there is a comparison between Spearman and Pearson Correlation that shows a perfect correlation for Spearman, but not for Pearson. The reason is that the data is monotonic, not linear.

The details of each technique go beyond this article's scope, but keep in mind that it's important to check the variable's type first and then choose the proper method for it. If you pick two quantitative variables, you can choose between Pearson Correlation or Spearman Correlation. If you choose two qualitative variables and they are ordinal, you can use any rank correlation technique. If they are nominal, you can use Chi-Square, Cramers V, or Goodman Kruskal's lambda. If you are using one quantitative and one qualitative variable, then Point-biserial correlation or the Logistic Regression. Of course, these recommendations are just high-level ideas. The choice will depend on the distributions of the data, the size of the dataset, and the objective of the analysis.

Another way to analyze the data is with graphs and tables. It’s important to note that the variable's type will determine the best chart to use. The following are some of the most frequent graphs used for quantitative and qualitative attributes to give you an idea.

**Frequency Tables:** Frequency tables summarize the data information into absolute or relative frequencies of each value (category). You can use them for qualitative and discrete data.

**Bar Charts:** You can also use bar charts to represent this frequency table in graphical form. This kind of graph is usually used to show the individual values of each column (feature/attributes) and make some comparison among all columns, identifying the ones with highest and lowest values.

**Pareto Plots:** Another useful chart for this type of variable is the **Pareto plot**, which helps identify the top issues that account for most of the problems based on the Pareto principle. This graph is based on a bar chart and line chart, where the latter represents the cumulative frequency (see figure 3).

**Pie Charts:** Another typical graph used for this type of variable is the Pie chart. However, according to *Storytelling with Data* author Cole Knaflic, you should avoid using pie charts because they can be hard to read and difficult to tell which slice is bigger (and by how much) when they are very similar in size. As an alternative to using pie charts, you can use horizontal bar charts with ordered data. However, to see the relationship of one part to the whole, the pie chart is a better choice.

**Contingency Tables:** One way to visualize the relationship of two qualitative data is through the contingency table, which displays the frequency distribution of the given attributes in a matrix.

**Histograms**: Histograms are graphical representations used to display the distribution of continuous data. The graph shows the position of the mean, median, dispersion, number of peaks, and more (see figure 4) and it’s useful to visualize the Univariate analysis we performed previously.

**Line Plots: **This chart is based on quantitative data and is often used to represent the data over a period of time. For example, if we want to understand the number of people affected by the Covid over time, the y-axis would represent the quantitative data about the number of people affected, and the x-axis would represent the date over a time period.

**Box Plots:** Another powerful graph is the box plot, and it's usually challenging to understand when it's the first time you see it. However, after understanding the nuances of that chart (see figure 6), you will easily observe the median, interquartile range, dispersion of the data, asymmetry, and the discrepant values, also known as outliers. But before understanding the outliers concept, let's see some ways to represent the multivariate analysis.

**Scatterplots**: Scatterplots are one of the most used graphs to compare the correlation of two quantitative variables. The y-axis represents one variable and the x-axis another.

**Heat Maps:** The heat map is another data visualization that is widely used, where a color represents the individual values of the matrix. Usually, intense colors represent higher values. There are many ways to use heat maps. For instance, you can apply the Pearson Correlation of all quantitative attributes from a dataset (see figure 7) to understand what variables have a high correlation. After, you can confirm with the scatterplot.

The last critical analysis to explore is the detection of outliers. Outliers are examples within the range of possible values that fall outside most instances in the dataset. There are three kinds of outliers: global, contextual, and collective.

**Global:**Global outliers are data points that occur far outside of most of the data. The simplest way to identify them is in the box plot - dots that appear in the chart are considered anomalous data.

**Contextual:**Data points that are considered contextual outliers are those that are not individually outliers (like global outliers) but when observed in certain contexts are. For example, let's say that babies born between 38 and 42 weeks are of normal size if they range from 5.5 to 8.8 pounds. If a baby is born at 37 weeks and is 8.6 Lbs, then we have a contextual outlier. Looking only at the size, it’s not an outlier, but when adding in the context of weeks, the baby deviates from the majority of babies.

**Collective:**Collective outliers are sets of data that deviate from the rest of the data set when looked at collectively, but are not considered global or contextual outliers. For example, if we observe the birth of male and female babies each month, and only male babies are born in a specific month, then it’s a collective outlier.

There are many ways to automate the process of detecting outliers, using either statistical data science methods with standard deviations, interquartile range, normal distribution, or by some machine learning methods, like clustering. Even though these techniques help automate this process, I encourage you to do the manual analysis before automating it in order to guarantee that the data and the method you plan to use will remove the outliers properly. We looked at the different ways to find outliers in the data, which plays an essential role in the process. As we saw previously, some measures are highly affected by outliers. Some models are very sensitive to outliers, while others are not. So depending on the model in use, it’s necessary to treat the data correctly.

As we’ve discussed, there are many techniques and ways to do Exploratory Data Analysis to understand how the data is structured. The EDA is a manual process that helps you fully understand the data, its distribution, and the presence or absence of outliers. However, some packages automate the EDA for you. For those who know Python, the two most common libraries are pandas-profiling and sweetviz. These tools provide helpful information about the data and cover some of the topics discussed in this article. Nevertheless, I strongly recommend only using tools like these as an initial analysis to get an overview of the data before applying what is covered here manually.

To summarize, the Exploratory Data Analysis phase helps you gain more knowledge about the domain and prevent potential issues from occurring. Once the analysis is complete, you will find that some data may be fixed or ignored. The model will not be built using all variables, either because some variables are irrelevant or because one variable has a high correlation with another. Additionally, you will know in advance the distribution of your data and the existence or non-existence of outliers, which can significantly impact the model you choose. With that, you will understand the data's patterns and trends and gain valuable insights.

Also, it's essential to highlight all findings you noticed from the Exploratory Data Analysis, the concerns and ideas for the given project that you got from the analysis, and all the work you will need to do in the preprocessing pipeline to train your model with adequate data. For those who want to get deeper into this topic, here are some book suggestions:

- Exploratory Data Analysis. John W.Tukey, 1977.
- Hands-On Exploratory Data Analysis with Python: Perform EDA techniques to understand, summarize, and investigate your data. Usman Ahmed & Suresh K. Mukhiya, 2020.
- Storytelling with Data: A Data Visualization Guide for Business Professionals. Cole N. Knaflic, 2015.

As a final note, keep in mind that this is a cyclic process that a data scientist should apply more than once in most cases. It's important to frequently do this analysis with the data, even if you've done it in the past, because the data may change in the future, and new patterns, trends, and distributions can arise.

**Are you looking for help with your next software project?**

You've come to the right place. Our technical recruitment team has been carefully handpicked. Contact us and we’ll have your team up and running in no time.

The post A Complete Introduction to Exploratory Data Analysis appeared first on Scalable Path.

]]>As artificial intelligence, or AI, increasingly becomes a part of our everyday lives, the need for understanding the systems behind this technology as well as their failings, becomes equally important. It’s simply not acceptable to write AI off as a foolproof black box that outputs sage advice. In reality, AI can be as flawed as its creators, leading to negative outcomes in the real world for real people. Because of this, understanding and mitigating bias in machine learning (ML) is a responsibility the industry must take seriously.

Bias is a complex topic that requires a deep, multidisciplinary discussion. In this article, I’ll share some real-world cases where ML bias has had negative impacts, before defining what bias really is, its causes, and ways to address it.

At a high-level, it’s important to remember that ML models, and computers in general, do not introduce bias by themselves. These machines are merely a reflection of what we as humans teach them. ML models use objective statistical techniques, and if they are somehow biased it’s because the underlying data is already biased in at least one of many ways. Understanding and addressing the causes of this are necessary to ensure an effective yet equitable use of the technology.

In the criminal justice system, there is a desire to predict if someone who has done something unlawful is likely to offend again. Taking it further, it would be valuable to be able to accurately classify these people on a scale of low to high risk going forward to assist in decision making. Predicting recidivism is an important challenge for society, but it is also inherently very difficult to do in an accurate way. This is especially so in the context of machine learning, given many unobservable causes and contributing factors that cannot be neatly fed into an ML model.

This problem has been the subject of many sociological and psychological research studies and is the focus of the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) software. COMPAS is used in various states in the US to predict whether a person is likely to offend again, and its predictions are taken into account by judges when deciding sentences. It also is a commonly referenced example of a biased ML model.

An 18-year-old African American girl was arrested in 2014 for the theft of a bicycle. She was charged with burglary amounting to $80. More recently, a 41 years old Caucasian man was picked up for shoplifting tools worth $86. The man was a repeat offender and had been previously convicted of many thefts and armed robberies. The girl had committed some minor offenses when she was younger. According to COMPAS, the girl was high risk and the man was low risk. Two years later, the girl had not been charged with any new crimes while the man was serving 8 years in prison for another offense.

Current versions may be different, but while being used by states in the US, the model predicted double the number of false positives for African Americans than for Caucasians, meaning that the models were much more likely to say that an African American was a “high” risk relative to Caucasian people.

The bias was obvious when results were seen from a specific angle, but removing such biases from these models was not deemed important, explicitly or implicitly. The opinion of many is that these models could have been “improved” to reduce unfair incarceration of African Americans, rather than exacerbating it. We will come back to this COMPAS system a couple of times during this article, given that it illustrates the real-world impact and complexities of bias in machine learning.

Examples of bias with more subtle implications can often be found in Natural Language Processing (NLP). Understanding language is very difficult for computers due to the involved nuance and context, and automatically translating between languages is even more of a challenge. Needless to say, it’s one of the hardest problems currently being tackled by AI.

Google Translate is a convenient tool that can be used to translate between languages of very different roots, and it works well enough most of the time. However, as with any ML tool, examples can be found with which the models perform poorly and exhibit bias. In the case of Google Translate, its current online version can be used to attempt to translate from English to Turkish, and back into English the sentence: “She is a programmer. He is a nurse.”

The translation produces “O bir programcı. O bir hemşire.” If you then translate that back from Turkish into English, what you get is “He’s a programmer. She is a nurse,” which of course is not what we started with, and exhibits the presence of bias in the model.

The bias is happening because Turkish uses a gender-neutral pronoun “o,” and in English, it gets translated into a gender-dependent pronoun, either “she” or “he.”

**How will the models decide what gender-dependent pronoun to use when there’s not enough information in the text being translated?
**

One potential approach would be to use independent information that will allow the translation to be correct most of the time, even if incorrect in some cases. For example, the models could reference the following results from the annual StackOverflow survey to decide what gender-dependent pronoun to use.

As can be seen, if that were the case, it would be obvious that most of the time a programmer will be male. If we looked at similar results for nurses, it would probably turn out that most of the time nurses are female. Note that I’m not claiming that the Google Translator models are using StackOverflow survey results, I’m just trying to show how these models can become biased due to their reliance on statistics. In reality, Google Translate is probably using information from countless texts collected in digital records over decades and the bias is being learned from our common language patterns.

This is why I mentioned at the beginning of this section that NLP bias is more subtle. Bias when processing language, especially when trained with digital records from online samples (e.g. Twitter, Facebook, Wikipedia, etc.) will learn the patterns we exhibit when speaking and writing, and we would be naive to think that everything we post online is not being used to train NLP models.

**The way we use language today influences how ML will work for us in the future. This should make us stop and think about the implications of the words we use and how we use them, and how that will impact our future when these models are responsible for more critical decisions.**

An example of how quickly NLP models can become problematic when trained with unfiltered data is that of Microsoft’s Tay Twitter Bot, an online ML system created to interact with Twitter users in real-time. In less than a day after being released the bot went from publishing friendly to very scary and downright offensive tweets. The bot was quickly turned off to avoid further brand damage. This reminds me of the following quote on the importance of language:

Joanna Bryson, Computer Science Professor at the University of Bath“All of this work shows that it is important how you choose your words. To me, this is actually a vindication of political correctness and affirmative action and all these things. Now, I see how important it is.”—

Another example where removing bias from ML models is particularly hard is in image analysis, commonly referred to as Computer Vision (CV). Just as with NLP, CV is a very complex area where many things can go wrong, and where correctly identifying objects in a consistent fashion is very difficult. CV models are artifacts trained to recognize objects in pictures or videos. Significant improvements have been achieved using Deep Learning (DL), and sometimes these models can even identify objects more accurately than humans. However, these models are imperfect and still make mistakes that sometimes are inoffensive, and other times can significantly offend people.

There is a well-documented instance of an offensive categorization happening when Google Photos incorrectly identified a picture of African Americans as “gorillas” in 2015. The tweet is still out there, and has led to a lot of discussion in the CV community:

A common source of these biases is known as under-representation, which means that there are examples for which not enough data is being used to train the models. For example, if you search for wedding pictures online, you’ll probably find pictures of women in long white dresses and men in black suits, and that comes from the fact that there are many more available examples of those types of weddings than of other types. If you invert the problem, you’ll find that if you use a CV model to identify what’s in these pictures, the accuracy of the results will depend on the tropes present in the sample data.

One of Google’s teams identified that problem and produced the example shown below.

Since then Google has taken a much more proactive and public stance in favor of producing ML that is socially conscious and respects people in ways that were not considered before. However, there’s still a long way to go on this front.

There are many definitions of bias, so it made sense to start this article with real-world examples of what bias is with respect to its use as being discussed. Now that we have seen some examples, we need to define the term to enable a meaningful discussion. Here are some informal definitions coming from different but related disciplines to ML, at different levels of technicality and abstraction.

- “Bias” in data: when a sample is not representative of the underlying population it represents. When this happens, your statistical derivations become inaccurate.
- “Bias” in statistics: the difference between expected and actual values. When this happens, even if your estimations are consistent and precise, they are incorrect.
- “Bias” in sociology": tendency, known or unknown, to prefer one factor over another, preventing objectivity. When this happens something is influencing decisions or observations in a way that is deemed undesirable.
- “Bias” in neural networks: a parameter for a neuron which is used to tune a model. This is used to make trade-offs between bias and variance (using their statistical definitions for ML).
- “Bias” as most people understand it: when past experiences or erroneous assumptions affect our perception in a way that negatively affects other people. When this happens we often say that something is unfair.

As is seen above, “bias” is a loaded term that means very different things in different contexts and disciplines. To be clear, in this article we’re referring to the last definition of the term: “Bias” as most people understand it. However, this definition is somewhat vague and requires further exploration.

For example, what does “erroneous assumptions” mean? What does “negatively affect other people” mean? Even though they imply very complex concepts and discussions, as humans we intuitively understand what those phrases mean, or can fill in the blanks when considering specific situations. However, we need to be more precise since ML requires specific and explicit rules, and that’s where things get complicated.

Let me start by stating the obvious: qualifying something as being “good” or “bad” is always relative to many complexities around the context of the qualification. For example, killing someone is considered “bad”, but killing an intruder in your house to protect your family is considered “good”. The problems being tackled by ML are much more nuanced than this, which is what makes the challenges of bias so difficult to handle.

Following that same idea, biases can be “good” or “bad” depending on the context, and sometimes there’s no definite answer that can be agreed on by two people, let alone entire societies. “Bias” is not inherently “bad,” even if the term is mostly used with negative connotations in general, and particularly within this article.

**If it’s hard to think of “biases” as being neutral (i.e. not necessarily “good” or “bad”), you can refer to them as “rules of thumb,” “shortcuts,” or “heuristics.”**

“Stealing is bad” works as a general rule (bias) that can be useful to quickly classify the likely morality of an action when no further detail is present, but that isn’t where the analysis of a scenario should end. Biases often help us make quick decisions that would otherwise be impractical to make or help us compensate for lack of information, but problems arise when we do not contextualize them appropriately or don’t update them as we receive more information.

In Kahneman’s Thinking Fast, Thinking Slow, he describes our brains as being conformed of two systems:

- The first system is quick to judgment, much like the “shortcuts” we described before as “biases.”
- The second system is slower and weighs data more carefully before making a decision.

Kahneman believes that with training and experience, we can learn to disengage our first system and engage more proactively with the second one - equivalent to building an ML model that’s accurate through meaningful analysis, instead of it relying on bias as a broad shortcut to an answer that’s likely to be statistically (but not categorically) accurate.

With respect to ML, a bias may simply be a correlation, fair or unfair, that leads to a certain kind of classification. This process is inherently neutral until given context and evaluated in terms of fairness. Bias, in a negative sense, is a requirement for something to be “unfair” - but there is no standard definition for “fairness.” Societies differ on what is “fair.” Even people within societies differ on what they regard as fair. “Fairness” in ML represents both an opportunity and a challenge.

Identifying appropriate “fairness criteria” for a system requires accounting for human experiences and perceptions, as well as cultural, social, historical, political, legal, and ethical considerations. Claims about bias and fairness are often about outcomes differing between groups, and the question of which are the relevant groups to consider is fundamentally a practical and moral one. Furthermore, at what level of granularity should groups be defined, and how should the boundaries between these groups be decided? When is it fair to define a group at all versus a better factoring of individual differences?

**“You can’t change what you don’t measure.”**

Imagine we have an urn that we can see contains 90 black balls and 10 white balls, and we’re told that we will win a prize if we guess correctly the color of the next ball that will be randomly taken out of the urn. What color would you pick? You would probably pick “black” as it’s more likely to be the case. Now, what if we go back to the COMPAS example and make similar assumptions? For example, let's assume historically we see 90% of repeat offenders are African American and 10% are Caucasian, and the prize society wins by identifying repeated offenders are lower crime rates. Who would you pick as someone who is likely to be a repeat offender? What if the percentages are 60% and 40% respectively? Well, of course it shouldn’t be color-based! Or should it? Wheelan puts it nicely in his book “Naked Statistics”:

“Is it ok to discriminate if the data tells us that we’ll be right far more often than wrong? It would be naive to think that gender, age, race, ethnicity, religion, and country of origin collectively tell us nothing about anything related to crime. But what we can or should do with that kind of information is a philosophical and legal question, not a statistical one. If we can build a model that identifies offenders correctly 80 out of 100 times, what happens to the poor souls in the 20 percent? Our model is going to harass them over and over and over again.” —Wheelan, Naked Statistics

Of course, real-world problems are never simple and require precise measurement to evaluate. At a minimum, for classification problems, it’s important to pay attention to the overall accuracy, as well as rates of true and false positives and negatives. To measure ML bias, you will want to track the performance of various accuracy metrics for different groups in your data at various levels of granularity, as well as cross-validate them across different sets of randomly selected features and observations. This is only a starting point for ensuring the accuracy and equity of an ML model - a holistic approach with meticulous analysis is required. There is much debate over how exactly to achieve this in the community, and counter-arguments were made by the company responsible for COMPAS to defend their methodology from a statistical approach.

This introduces an important question. How do we decide which measure of fairness is appropriate? It will depend on the expertise of the people involved in building these systems, but in general, answering the following two questions should help enlighten the decision of what accuracy metrics should be used for a given ML problem:

- What aspects of our society do we wish to be ignored by ML models?
- What biases in our society do we wish to see corrected or changed?

Finally, keep in mind that metrics are often aggregated. However, in some ML problems, even a single classification can be quite harmful, such as a false positive identifying an individual as a threat. Therefore, it’s not enough to control for those metrics in aggregate, but also it’s important to identify case-by-case instances where incorrect biases can be harmful.

Bias in ML models has been recognized as a very important challenge to address, which has led to regulatory involvement. For example, in the banking and financial industry in the United States, the Equal Credit Opportunity Act for fair lending states that institutions can not discriminate based on race, sex, age, national origin, or marital status, or proxies of these concepts. In the United States, postal codes are highly correlated to race, and therefore can not be used to train ML models that will be used to make decisions of whether to give credit to a person.

An international example can be seen in the AI Principles published by the Organisation for Economic Cooperation and Development (OECD), which were adopted by forty-two countries, and not only focus on fairness but also on the privacy aspects of ML models - another very important topic not touched on in this article. One of such principles directly refers to ML fairness and the necessity of introducing humans into the loop to keep the use of these ML models ethical.

“AI systems should be designed in a way that respects the rule of law, human rights, democratic values and diversity, and they should include appropriate safeguards – for example, enabling human intervention where necessary – to ensure a fair and just society.”

The “Recommendation of the Council on Artificial Intelligence” is another document published by the OECD which states that models should be transparent, explainable, accountable, and robust, all of which are concepts we’ve touched on in this article with respect to ML bias and fairness. A third example is the often discussed in data-related company problems, is Europe’s General Data Protection Regulation (GDPR), which states the following:

“In any case, such processing should be subject to suitable safeguards, including specific information of the data subject and the right to obtain human intervention, to express his or her point of view, to get an explanation of the decision reached after such assessment and the right to contest the decision. In order to ensure fair and transparent processing in respect of the data subject, having regard to the specific circumstances and context in which the personal data are processed, the controller should use adequate mathematical or statistical procedures for the profiling, implement technical and organisational measures appropriate to ensure in particular that factors which result in data inaccuracies are corrected and the risk of errors is minimized, secure personal data in a way which takes account of the potential risks involved for the interests and rights of the data subject and which prevents inter alia discriminatory effects against individuals on the basis of race or ethnic origin, political opinions, religion or beliefs, trade union membership, genetic or health status, sexual orientation or that result in measures having such effect.”

It’s a very positive sign that regulatory bodies are taking action when it comes to ensuring ML models are fair. It’s a socially relevant topic that would probably not be addressed if left to companies and ML professionals alone, since there’s no immediate commercial incentive for them to do so, especially when compared against the high cost of adequately dealing with ML bias.

Consider the case of the previously mentioned Tay Twitter bot. Microsoft argued that one of the reasons that Tay’s speech became so offensive was because of trolls intentionally sending it malicious text, which then served to train the bot. Failure by its developers to account for this led to the termination of the project, and likely many lessons learned for the ML community as a whole.

In scenarios where the training data or methodology (such as pitting two models against each other) involves input from other parties, it becomes clear that considering the potential for malicious actors becomes necessary. This may seem like science-fiction, but Adversarial ML and respective counter-strategies are becoming a reality in the field. As AI use becomes more prevalent, and as they begin to interact with each other and the public in general, there’s sure to be some conflicts and unexpected outcomes that arise.

As we have seen, ML bias is a complex topic, both from a moral and technological standpoint. The negative impact of bias in real-world scenarios is clear, and as a result, the industry and regulators have been taking steps to minimize the amount a system can be viewed as “unfair”. Still, there is much work to be done as the field evolves.

We, as the worldwide ML community, must make sure we don’t leave these issues unaddressed as these models are the building blocks over which more autonomous systems will be built. As AIs become more omnipresent in our daily lives, the efficacy of their design will have real consequences on our way of living. It’s up to us and informed citizens using their voice to drive policy, to ensure a future where AI is a fair force for good in the world.

**Are you looking for help with your next software project?**

You've come to the right place. Our technical recruitment team has been carefully handpicked. Contact us, and we’ll have your team up and running in no time.

The post Understanding Bias in Machine Learning appeared first on Scalable Path.

]]>Joanna Bryson, Computer Science Professor at the University of Bath“All of this work shows that it is important how you choose your words. To me, this is actually a vindication of political correctness and affirmative action and all these things. Now, I see how important it is.”—

- “Bias” in data: when a sample is not representative of the underlying population it represents. When this happens, your statistical derivations become inaccurate.
- “Bias” in statistics: the difference between expected and actual values. When this happens, even if your estimations are consistent and precise, they are incorrect.
- “Bias” in sociology": tendency, known or unknown, to prefer one factor over another, preventing objectivity. When this happens something is influencing decisions or observations in a way that is deemed undesirable.
- “Bias” in neural networks: a parameter for a neuron which is used to tune a model. This is used to make trade-offs between bias and variance (using their statistical definitions for ML).
- “Bias” as most people understand it: when past experiences or erroneous assumptions affect our perception in a way that negatively affects other people. When this happens we often say that something is unfair.

- The first system is quick to judgment, much like the “shortcuts” we described before as “biases.”
- The second system is slower and weighs data more carefully before making a decision.

**“You can’t change what you don’t measure.”**

“Is it ok to discriminate if the data tells us that we’ll be right far more often than wrong? It would be naive to think that gender, age, race, ethnicity, religion, and country of origin collectively tell us nothing about anything related to crime. But what we can or should do with that kind of information is a philosophical and legal question, not a statistical one. If we can build a model that identifies offenders correctly 80 out of 100 times, what happens to the poor souls in the 20 percent? Our model is going to harass them over and over and over again.” —Wheelan, Naked StatisticsOf course, real-world problems are never simple and require precise measurement to evaluate. At a minimum, for classification problems, it’s important to pay attention to the overall accuracy, as well as rates of true and false positives and negatives. To measure ML bias, you will want to track the performance of various accuracy metrics for different groups in your data at various levels of granularity, as well as cross-validate them across different sets of randomly selected features and observations. This is only a starting point for ensuring the accuracy and equity of an ML model - a holistic approach with meticulous analysis is required. There is much debate over how exactly to achieve this in the community, and counter-arguments were made by the company responsible for COMPAS to defend their methodology from a statistical approach. This introduces an important question. How do we decide which measure of fairness is appropriate? It will depend on the expertise of the people involved in building these systems, but in general, answering the following two questions should help enlighten the decision of what accuracy metrics should be used for a given ML problem:

- What aspects of our society do we wish to be ignored by ML models?
- What biases in our society do we wish to see corrected or changed?

The “Recommendation of the Council on Artificial Intelligence” is another document published by the OECD which states that models should be transparent, explainable, accountable, and robust, all of which are concepts we’ve touched on in this article with respect to ML bias and fairness. A third example is the often discussed in data-related company problems, is Europe’s General Data Protection Regulation (GDPR), which states the following:“AI systems should be designed in a way that respects the rule of law, human rights, democratic values and diversity, and they should include appropriate safeguards – for example, enabling human intervention where necessary – to ensure a fair and just society.”

It’s a very positive sign that regulatory bodies are taking action when it comes to ensuring ML models are fair. It’s a socially relevant topic that would probably not be addressed if left to companies and ML professionals alone, since there’s no immediate commercial incentive for them to do so, especially when compared against the high cost of adequately dealing with ML bias.“In any case, such processing should be subject to suitable safeguards, including specific information of the data subject and the right to obtain human intervention, to express his or her point of view, to get an explanation of the decision reached after such assessment and the right to contest the decision. In order to ensure fair and transparent processing in respect of the data subject, having regard to the specific circumstances and context in which the personal data are processed, the controller should use adequate mathematical or statistical procedures for the profiling, implement technical and organisational measures appropriate to ensure in particular that factors which result in data inaccuracies are corrected and the risk of errors is minimized, secure personal data in a way which takes account of the potential risks involved for the interests and rights of the data subject and which prevents inter alia discriminatory effects against individuals on the basis of race or ethnic origin, political opinions, religion or beliefs, trade union membership, genetic or health status, sexual orientation or that result in measures having such effect.”

The post Understanding Bias in Machine Learning appeared first on Scalable Path.

]]>