Data Preprocessing: 6 Techniques to Clean Data

ProfilePicture of Nicolas Azevedo
Nicolas Azevedo
Senior Data Scientist
A man managing a remote drone to clean a database

The data preprocessing phase is the most challenging and time-consuming part of data science, but it’s also one of the most important parts. If you fail to clean and prepare the data, it could compromise the model.

“If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.”

– Andrew Ng

When dealing with real-world data, Data Scientists will always need to apply some preprocessing techniques in order to make the data more usable. These techniques will facilitate its use in machine learning (ML) algorithms, reduce the complexity to prevent overfitting, and result in a better model.

With that said, let’s get into an overview of what data preprocessing is, why it’s important, and learn the main techniques to use in this critical phase of data science. Here’s everything we’ll cover in this guide: 

Table Of Contents

What is Data Preprocessing?

After understanding the nuances of your dataset and the main issues in the data through the Exploratory Data Analysis, data preprocessing comes into play by preparing your dataset for use in the model. 

In an ideal world, your dataset would be perfect and without any problems. Unfortunately, real-world data will always present some issues that you’ll need to address. Consider, for instance, the data you have in your company. Can you think of any inconsistencies such as typos, missing data, different scales, etc.? These examples often happen in the real world and need to be adjusted in order to make the data more useful and understandable. 

This process, where we clean and solve most of the issues in the data, is what we call the data preprocessing step.

Why is Data Preprocessing Important?

If you skip the data preprocessing step, it will affect your work later on when applying this dataset to a machine learning model. Most of the models can’t handle missing values. Some of them are affected by outliers, high dimensionality and noisy data, and so by preprocessing the data, you’ll make the dataset more complete and accurate. This phase is critical to make necessary adjustments in the data before feeding the dataset into your machine learning model.

Important Data Preprocessing Techniques

Now that you know more about the data preprocessing phase and why it’s important, let’s look at the main techniques to apply in the data, making it more usable for our future work. The techniques that we’ll explore are: 

  • Data Cleaning
  • Dimensionality Reduction
  • Feature Engineering
  • Sampling Data 
  • Data Transformation
  • Imbalanced Data 

Data Cleaning

One of the most important aspects of the data preprocessing phase is detecting and fixing bad and inaccurate observations from your dataset in order to improve its quality. This technique refers to identifying incomplete, inaccurate, duplicated, irrelevant or null values in the data. After identifying these issues, you will need to either modify or delete them. The strategy that you adopt depends on the problem domain and the goal of your project. Let’s see some of the common issues we face when analyzing the data and how to handle them. 

Noisy Data

Usually, noisy data refers to meaningless data in your dataset, incorrect records, or duplicated observations. For example, imagine there is a column in your database for ‘age’ that has negative values. In this case, the observation doesn’t make sense, so you could delete it or set the value as null (we’ll cover how to treat this value in the “Missing Data” section). 

Another case is when you need to remove unwanted or irrelevant data. For example, say you need to predict whether a woman is pregnant or not. You don’t need the information about their hair color, marital status or height, as they are irrelevant for the model.

An outlier can be considered noise, even though it might be a valid record, depending on the outlier. You’ll need to determine if the outlier can be considered noise data and if you can delete it from your dataset or not.

Solution: 

A common technique for noise data is the binning approach, where you first sort the values, then divide them into “bins” (buckets with the same size), and then apply a mean/median in each bin, smoothing it. If you want to learn more, here is a good article on dealing with noise data.

Missing Data

Another common issue that we face in real-world data is the absence of data points. Most machine learning models can’t handle missing values in the data, so you need to intervene and adjust the data to be properly used inside the model. There are different approaches you can take to handle it (usually called imputation):

Solution 1

The simplest solution is to remove that observation. However, this is only recommended if:

1) You have a large dataset and a few missing records, so removing them won’t impact the distribution of your dataset. 

2) Most of the attributes of that observation are null, so the observation itself is meaningless. 

Solution 2: 

Another solution is to use a global constant to fill that gap, like “NA” or 0, but only if it’s difficult to predict the missing value. An alternative option is to use the mean or median of that attribute to fill the gap. 

Solution 3: 

Using the backward/forward fill method is another approach that can be applied, where you either take the previous or next value to fill the missing value. 

Solution 4: 

A more robust approach is the use of machine learning algorithms to fill these missing data points. For example: 

  • Using KNN, first find the k instances closer to the missing value instance, and then get the mean of that attribute related to the k-nearest neighbors (KNN).
  • Using regression, for each missing attribute, learn a regressor that can predict this missing value based on the other attributes.

It’s not easy to choose a specific technique to fill the missing values in our dataset, and the approach you use strongly depends on the problem you are working on and the type of missing value you have. 

This topic goes beyond the scope of this article, but keep in mind that we can have three different types of missing values, and each has to be treated differently: 

  • Type 1: Missing Completely at Random (MCAR)
  • Type 2: Missing at Random (MAR)
  • Type 3: Missing Not at Random (MNAR)

If you are familiar with Python, the sklearn library has helpful tools for this data preprocessing step, including the KNN Imputer I mentioned above.

Structural Errors

Structural errors usually refer to some typos and inconsistencies in the values of the data. 

For example, say that there is a marketplace and we sell shoes on our website. The data about the same product can be written in different ways by different sellers that sell the same shoes. Imagine that one of the attributes we have is the brand of the shoes, and aggregating the name of the brand for the same shoes we have: Nike, nike, NIKE. We need to fix this issue before giving this data to the model, otherwise, the model may treat them as different things. In this case, it’s an easy fix: just transform all the words to lowercase. It may require more complex changes to fix inconsistencies and typos in other scenarios, though. 

This issue generally requires manual intervention rather than applying some automated techniques.

Dimensionality Reduction 

The dimensionality reduction is concerned with reducing the number of input features in training data. 

The Curse of Dimensionality in Your Dataset

With a real-world dataset, there are usually tons of attributes, and if we don’t reduce this number, it may affect the model’s performance later when we feed it this dataset. Reducing the number of features while keeping as much variation in the dataset as possible will have a positive impact in many ways, such as: 

  • Requiring less computational resources 
  • Increasing the overall performance of the model
  • Preventing overfitting (when the model becomes too complex and the model memorizes the training data, instead of learning, so in the test data the performance decreases a lot)
  • Avoiding multicollinearity (high correlation of one or more independent variables). Also, applying this technique will reduce the noise data. 

Let’s dive into the main types of dimensionality reduction we can apply to our data to make it better for later use.

Feature Selection

Feature selection refers to the process of selecting the most important variables (features) related to your prediction variable, in other words, selecting the attributes which contribute most to your model. Ere are some techniques for this approach that you can apply either automatically or manually:

  • Correlation Between Features: This is the most common approach, which drops some features that have a high correlation with others. 
  • Statistical Tests: Another alternative is to use statistical tests to select the features, checking the relationship of each feature individually with the output variable. There are many examples in the scikit-learn library like SelectKBest, SelectPercentile, chi2, f_classif, f_regression
  • Recursive Feature Elimination (RFE): The Recursive Feature Elimination, also known as Backward Elimination, where the algorithm trains the model with all features in the dataset, calculating the performance of the model, and then drops one feature at a time, stopping when the performance improvement becomes negligible. 
  • Variance Threshold: Another feature selection method is the variance threshold, which detects features with high variability within the column, selecting those that got over the threshold. The premise of this approach is that features with low variability within themselves have little influence on the output variable. 

Also, some models automatically apply a feature selection during the training. The decision-tree-based models can provide information about the feature importance, giving you a score for each feature of your data. The higher the value, the more relevant it is for your model. For more algorithms implemented in sklearn, consider checking the feature_selection module.

Linear Methods

As the name suggests, the linear methods use linear transformations to reduce the dimensionality of the data.

The most common approach: The Principal Component Analysis (PCA, in terms of memory efficiency and sparse data, you may use IncrementalPCA or SparsePCA), a method that transforms the original features in another dimensional space captures much of the original data variability with far fewer variables. However, the new transformed features lose the interpretability of the original data, and it only works with quantitative variables. 

Other types of linear methods are Factor Analysis and Linear Discriminant Analysis.

Non-Linear Methods

The non-linear methods (or manifold learning methods) are used when the data doesn’t fit in a linear space. The idea behind this technique is that in a high dimensional space, most of the important features lie in a small number of low dimensional manifolds. Many algorithms make use of this approach. 

The Multi-Dimensional Scaling (MDS) is one of those, and it calculates the distance between each pair of objects in a geometric space. This algorithm transforms the data to a lower dimension, and the pairs that are close in the higher dimension remain in the lower dimension as well. 

The Isometric Feature Mapping (Isomap) is an extension of MDS, but instead of Euclidean distance, it uses the geodesic distance. 

Other examples of non-linear methods are Locally Linear Embedding (LLE), Spectral Embedding, t-distributed Stochastic Neighbor Embedding (t-SNE). To learn more about this method and see all algorithms implemented in sklearn, you can check their page specifically about it.

Feature Engineering: Using Domain Knowledge to Create Features 

The feature engineering approach is used to create better features for your dataset that will increase the model’s performance. We mainly use domain knowledge to create those features, which we manually generate from the existing features by applying some transformation to them. Here are some basic examples you can easily apply to your dataset to potentially increase your model’s performance:

Decompose Categorical Attributes

The first example is decomposing categorical attributes from your dataset. Imagine that you have a feature in your data about hair color and the values are brown, blonde and unknown. In this case, you can create a new column called “has color” and assign 1 if you get a color and 0 if the value is unknown.

Decompose a DateTime

Another example would be decomposing a datetime feature, which contains useful information, but it’s difficult for a model to benefit from the original form of the data. So if you think that your problem has time dependencies, and you may find some relationship between the datetime and the output variable, then spend some time trying to convert that datetime column into a more understandable feature for your model, like “period of day,” “day of the week,” and so on.

Reframe Numerical Quantities

This last example is more about handling numerical data. Let’s say that you have a dataset about some purchases of clothes for a specific store. Besides the absolute number of purchases, you may find interest in creating new features regarding the seasonality of that purchase. So you may end up adding four more columns to your dataset about purchases in summer, winter, fall, and spring. Depending on the problem you are trying to solve it may help you and increase the quality of your dataset.

Therefore, this section is more about using your domain knowledge about the problem to create features that have high predictive power. If you want to learn more about this, here’s a great blog on feature engineering.

Handling a Large Amount of Data (Sampling Data)

Even though the more data you have, the greater the model’s accuracy tends to be, some machine learning algorithms can have difficulty handling a large amount of data and run into issues like memory saturation, computational increase to adjust the model parameters, and so on. To address this problem, here are some of the sampling data techniques we can use:

  • Sampling without replacement. This approach avoids having the same data repeated in the sample, so if the record is selected, it’s removed from the population.
  • Sampling with replacement. With this approach, the object is not removed from the population and can be repeated multiple times for the sample data since it can be picked up more than once.
  • Stratified sampling. This method is more complex and refers to splitting the data into many partitions and getting random samples for each partition. In cases where the classes are disproportional, this approach keeps the proportional number of classes according to the original data.
  • Progressive sampling. This last technique starts with a small size and keeps increasing the dataset until a sufficient sample size is acquired.

Data Transformation: Converting the Data to the Same Structure 

One of the most critical steps in the preprocessing phase is data transformation, which converts the data from one format to another. Some algorithms expect that the input data is transformed, so if you don’t complete this process, you may get poor model performance or even create bias. 

For example, the KNN model uses distance measures to compute the neighbors that are closer to a given record. If you have a feature whose scale is very high compared with other features in your model, then your model will tend to use more of this feature than the others, creating a bias in your model. Some of the main techniques used to deal with this issue are:  

Transformation for Categorical Variables

Categorical variables, usually expressed through text, are not directly used in most machine learning models, so it’s necessary to obtain numerical encodings for categorical features. The approach you use will depend on the type of variables. 

Ordinal Variables

Suppose you have ordinal qualitative data, which means that order exists within the values (like small, medium, large). In that case, you need to apply a mapping function to replace the string into a number like: {“small”: 1, “medium”: 2, “large”: 3}. You can use the Label Encoder class in sklearn, which does that for you.

Nominal Variables

If you have nominal variables in your database, which means that there is no order among the values, you cannot apply the strategies you used with ordinal data. The most common technique used with this type of variable is the One Hot Encoding, which transforms one column into n columns (where n represents the unique values of the original column), assigning 1 to the label in the original column and 0 for all others. You may also come across people using get_dummies from pandas. 

For example, imagine a season column with four labels: Winter, Spring, Summer, and Autumn. Applying the one-hot encoding transforms it to season_winter, season_spring, season_summer and season_autumn. If you have a value of Summer assigned to season in your record, it will translate to season_summer 1, and the other three columns will be 0. 

When working with One Hot Encoding, you need to be aware of the multicollinearity problem. A simple solution is to remove one of the columns. Referring to the example above, if we have season_summer, season_spring, and season_autumn as 0, we know it’s winter.

You could see above the main techniques to handle data transformation with qualitative data, so now let’s look at some of the different methods for continuous data.

Min-Max Scaler / Normalization

The min-max scaler, also known as normalization, is one of the most common scalers and it refers to scaling the data between a predefined range (usually between 0 and 1). The main issue with this technique is that it’s sensitive to outliers, but it’s worth using when the data doesn’t follow a normal distribution. This method is beneficial for algorithms like KNN and Neural Networks since they don’t assume any data distribution.

Standard Scaler

The standard scaler is another widely used technique known as z-score normalization or standardization. It transforms the data so that the mean of the data is zero and the standard deviation is one. This approach works better with data that follows the normal distribution and it’s not sensitive to outliers. 

Other Scalers

The min-max and standard scaler are the most common methods, but many different techniques may be helpful for your application, such as:

  • The maxAbs scaler: This technique takes the absolute maximum value of the feature and divides each record by this max value, scaling the data in the range of -1 and 1.
  • The robust scaler: This technique removes the median from the data and scales it using the interquartile range (IQR). As the name suggests, it’s robust to outliers.
  • The power transformer scaler: This technique changes the data distribution, making it more like a normal distribution. It’s used most with heteroscedasticity data, which means that all variables don’t have the same variance.

Depending on the problem at hand, different scalers will help you improve your results. I’ve listed the most common options, but there are more you can find out there. 

Handling Data with Unequal Distribution of Classes (Imbalanced Data)

One of the most common problems we face when dealing with real-world data classification is that the classes are imbalanced (one of the classes has more examples than the other), creating a strong bias for the model. 

Imagine that you want to predict if a transaction is fraudulent. Based on your training data, 95% of your dataset contains records about normal transactions, and only 5% of your data is about fraudulent transactions. Based on that, your model most likely will tend to predict the majority class, classifying fraudulent transactions as normal ones. 

There are three main techniques that we can use to address this deficiency in the dataset:

  1. Oversampling 
  2. Undersampling 
  3. Hybrid 

Oversampling

The oversampling approach is the process of increasing your dataset with synthetic data of the minority class. The most popular technique used for this is the Synthetic Minority Oversampling Technique (SMOTE). Briefly, it takes a random example from the minority class. Then another random data point is selected through k-nearest neighbors of the first observation, and a new record is created between these two selected data points. You can find this technique in the imbalanced-learn library in Python.

Undersampling

The undersampling technique, in contrast, is the process of reducing your dataset and removing real data from your majority class. The main algorithms used in this approach are the TomekLinks, which removes the observation based on the nearest neighbor, and the Edited Nearest Neighbors (ENN), which uses the k-nearest neighbor instead of only one as in Tomek.

Hybrid

The hybrid approach combines the oversampling and undersampling techniques in your dataset. One of the algorithms that are used in this method is the SMOTEENN, which makes use of the SMOTE algorithm for oversampling in the minority class and ENN for undersampling in the majority class.

The Data Preprocessing Pipeline

Although it isn’t possible to establish a rule for the data preprocessing steps for our machine learning pipeline, in general, what I use and what I’ve come across is the following flow of data preprocessing operations:

Diagram of the Data Preprocessing pipeline
  1. Step 1: Start by analyzing and treating the correctness of attributes, like identifying noise data and any structural error in the dataset.
  2. Step 2: Analyze missing data, along with the outliers, because filling missing values depends on the outliers analysis. After completing this step, go back to the first step if necessary, rechecking redundancy and other issues.
  3. Step 3: The process of adding domain knowledge into new features for your dataset. If you don’t get any useful new features for your project, don’t worry and avoid creating useless features.
  4. Step 4: Use this step for transforming the features into the same scale/unit. If you doubt that the model you will be using needs the data on the same scale, then apply it. It won’t negatively affect the models that don’t need data transformation.
  5. Step 5: This stage avoids the curse of dimensionality, so if you think you’re having this problem, you must apply this step in your pipeline. It comes after data transformation because some of the techniques (e.g., PCA) need transformed data.
  6. Step 6: The last part before moving to the model phase is to handle the imbalanced data. Also, there are some specific metrics for calculating the model’s performance when you have this issue in your data.

I didn’t mention the sampling data step above, and the reason is that I encourage you to try all data you have. If you have a large amount of data and can’t handle it, consider using the approaches from the data sampling phase. 

It’s important to note that this may not always be the exact order you should follow, and you may not apply all of these steps in your project, and it will entirely depend on your problem and the dataset. 

Conclusion

The data preprocessing phase is crucial for determining the correct input data for the machine learning algorithms. As we saw previously, without applying the proper techniques, you can have a worse model result. For example, the k-nearest neighbors algorithm is affected by noisy and redundant data, is sensitive to different scales, and doesn’t handle a high number of attributes well. If you use this algorithm, you must clean the data, avoid high dimensionality and normalize the attributes to the same scale. 

However, if you use a Decision Tree algorithm, you don’t need to worry about normalizing the attributes to the same scale. Thus, each model has its own peculiarity, and you need to know beforehand to give a proper data input to the model. With that said, now you can move forward to the model exploration phase and know those peculiarities of the algorithms.

One last important thing to remember, which is usually a common mistake in this field, is that you need to split your dataset into training and test sets before applying some of these techniques, using only the training set to learn and apply it in the test part. For those already familiar with Python and sklearn, you apply the fit and transform method in the training data, and only the transform method in the test data.


Looking to hire?

Join our newsletter

Join thousands of subscribers already getting our original articles about software design and development. You will not receive any spam. just great content once a month.

 

Read Next

Browse Our Blog