
Photo by DeepMind on Unsplash
Obviously, ML teams developing new models or algorithms expect the model’s performance on test data to be optimal.
But many times it just doesn’t happen.
There can be many reasons, but the main culprits are:
- Lack of sufficient data
- Bad quality data
- Overfitting
- Perfect fit
- Wrong choice of algorithm
- Hyperparametric tuning
- Database bias
The above list, however, is not exhaustive.
In this article, we will discuss a process that can solve many of the above problems, and ML teams will be very careful when doing it.
It is data preprocessing.
It is widely accepted in the machine learning community that data preprocessing is an important step in the ML workflow and can improve model performance.
There are many studies and articles that have shown the importance of data preprocessing in machine learning, such as:
“A study by Bezdek et al. (1984) showed that data preprocessing improves the accuracy of several clustering algorithms by up to 50%.
“A study by Chollet (2018) showed that data pre-processing techniques such as data normalization and data augmentation can improve the performance of deep learning models.”
It is also worth noting that preprocessing techniques are important not only to improve model performance, but also to make the model more interpretable and robust.
For example, handling missing values, removing outliers, and scaling data can help prevent overfitting, which can lead to models that generalize better to new data.
In any case, it is important to note that the specific preprocessing techniques and amount of preprocessing required for a given database will depend on the nature of the data and the specific requirements of the algorithm.
It is also important to remember that in some cases data preprocessing may not be necessary or even detrimental to the performance of the model.
Preprocessing data before applying it to a machine learning (ML) algorithm is an important step in the ML workflow.
This step helps ensure that the data is in a format that the algorithm can understand and that it is free of errors or outliers that could negatively affect the performance of the model.
In this article, we will discuss some of the benefits of data preprocessing and provide a code example of how to preprocess data using the popular Python library, Pandas.
One of the main benefits of data preprocessing is that it helps improve model accuracy. By cleaning and formatting the data, we can ensure that the algorithm only considers relevant information and that it is not influenced by any irrelevant or incorrect data.
This can lead to a more accurate and robust model.
Another benefit of data preprocessing is that it can help reduce the time and resources required to train a model. By removing irrelevant or redundant data, we can reduce the amount of data the algorithm needs to process, which can significantly reduce the amount of time and resources required to build the model.
Data preprocessing can also help prevent overfitting. Overfitting occurs when a model is trained on a dataset that is too specific, and as a result it performs well on the training data but poorly on new, unseen data.
By preprocessing the data and removing irrelevant or redundant information, we can help reduce the risk of overfitting and improve the model’s ability to generalize to new data.
Data preprocessing can also improve model interpretability. By cleaning and formatting the data, we can make it easier to understand the relationships between different variables and how they affect the model’s predictions.
This can help us better understand the behavior of the model and make more informed decisions about how to improve it.
For example:
Now let’s see an example of data preprocessing using Pandas. We will use a database that contains information about the quality of the wine. The dataset has several features such as alcohol, chlorides, density, etc., and a target variable, wine quality.
import pandas as pd
# Load the data
data = pd.read_csv("winequality.csv")
# Check for missing values
print(data.isnull().sum())
# Drop rows with missing values
data = data.dropna()
# Check for duplicate rows
print(data.duplicated().sum())
# Drop duplicate rows
data = data.drop_duplicates()
# Check for outliers
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data = data[
~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)
]
# Scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
data_scaled, data["quality"], test_size=0.2, random_state=42
)
In this example, we first load the data using the read_csv function from Pandas and then check for missing values ​​using the isnull function. Then we remove rows with missing values ​​using the dropna function.
Next, we check for duplicate rows using the duplicate function and remove them using the drop_duplicate function.
We then test for outliers using the interquartile range (IQR) method, which calculates the difference between the first and third quartiles. Any data point that falls outside the limits of 1.5 times the IQR is considered an outlier and is removed from the database.
After handling missing values, duplicate rows, and outliers, we scale the data using the StandardScaler function from the sklearn.preprocessing library. Data scaling is important because it helps ensure that all variables are on the same scale, which is necessary for most machine learning algorithms to function properly.
Finally, we split the data into training and test groups using the train_test_split function from the sklearn.model_selection library. This step is necessary to evaluate the performance of the model on unseen data.
Not pre-processing data before applying it to a machine learning algorithm can have several negative consequences. Some of the main problems that can occur are:
- Poor model performance. If the data is not properly cleaned and formatted, the algorithm may not be able to understand it correctly, which can lead to poor model performance. This can be caused by missing values, outliers, or inconsistent data that have not been removed from the data set.
- Overfitting. If the database is not cleaned and preprocessed, it may contain irrelevant or redundant information, which may lead to overfitting. Overfitting occurs when a model is trained on a dataset that is too specific, and as a result it performs well on the training data but poorly on new, unseen data.
- Longer learning times. preprocessing the data can result in longer training times because the algorithm can process more data than necessary, which can be time-consuming.
- Difficulty understanding the model. If the data is not pre-processed, it can be difficult to understand the relationships between different variables and how they affect the model’s predictions. This can make it difficult to spot model errors or areas for improvement.
- Biased results. If the data is not pre-processed, it may contain errors or biases that may lead to unfair or inaccurate results. For example, if the data contains missing values, the algorithm may be working with a biased sample of data, which may lead to incorrect conclusions.
In general, data preprocessing can lead to models that are less accurate, less interpretable, and more difficult to work with. Data preprocessing is an important step in the machine learning workflow that should not be skipped.
In conclusion, preprocessing data before applying it to a machine learning algorithm is an important step in the ML workflow. It helps improve accuracy, reduce the time and resources required to train a model, prevent overfitting, and improve model interpretability.
The above code example shows how to preprocess data using the popular Python library Pandas, but there are many other libraries available for data preprocessing, such as NumPy and Scikit-learn. which can be used depending on the specific needs of your project.
Sumit Singh is a serial entrepreneur working towards Data Centric AI. He co-founded Labellerr, a next-generation learning data platform. Labellerr’s platform allows AI-ML teams to easily automate their data preparation pipeline.