Table of Contents
Introduction To Data Pre-processing in Data Science
Data preprocessing in data science is a crucial step that helps improve the Quality of data to promote the extraction of meaningful insights from the given data. Data preprocessing refers to the technique of cleaning and organizing the raw data to make it suitable for building and training machine learning models. Data preprocessing is a technique that transforms raw data into an informative and readable format.
What is data preprocessing in data science and why it is required?
In real-world data is very row there is incomplete information, inconsistent data, missing values, and inaccurate data means data contain outliers. So for that data preprocessing is the first step to perform any data analysis process and for the
machine learning model. data preprocessing in data science is helped us to organize raw data.
Data preprocessing techniques in data science
- Missing Value Treatment
- Outlier Treatment
- Dealing With Categorical Data
- Scaling and Transformation
- Splitting DataSet
Missing Value Treatment
Data contains missing values for many reasons such as observing the data is not recorded, data corruption. So when your data contains the missing value that means we don’t get the right analysis on our data and many machine learning algorithms don’t support these missing values. That is the reason behind missing value treatment.
There are two important processes in handling this missing value in pandas.
- Dropping
- Imputation of null value
Dropping missing values
In this process of dropping missing value, there is in data columns contain more than 50% null values that time the best process to handling missing value drops that missing values because according to the non-null value present in data not give us required information to fill that missing values. and also in the data columns contain the smallest number of null values tat time dropping is the best method. For dropping missing values in columns use dropna() function.
Syntex does dropping null values in data.
DataFrame.dropna(axis = 0 / 1 , how = (‘all’/ ’any’),subset = [‘column name’], thresh = any number)
Where,
axis = 0 -> It is for check null value in rows
axis = 1 -> It check the null values in columns
how = ‘ all ’ -> It check the all row or column value are null then drop that
how = ‘any’ -> it check any single null value in row and column contain then drop it
thresh = it checks if at least a number of non null values contain columns/rows or not. Ex. thresh = 2 it checks that that row or column contains a/.;oki87665ytr non null value or not.
Imputation of missing value
Sometimes rather than dropping missing values, you’d rather replace them with a valid value. Every time dropping it is not good for all problem statements because of some useful data insight from the other columns or rows. Now, a better way to fill the null values. This value might be a single number like zero, mean, median, or mode. To fill null values in pandas used fillna().
Important way to fill null value using the mean, median, and mode.
Mean - It is used when your data is not skewed ( i.e. normal distributed)
Median - It is used when your data is skewed ( i.e.not normal distributed)
Mode:- It is used when your data is skewed ( i.e.not normal distributed) mostly used for filling categorical null value.
Syntax:-
fillna(value,method=( ’ffill’ / ’bfill ’ ),axis = 0/1)
Method = ‘ffill’ -> it fill null values in forward direction
Method = ‘bfill’ -> it fill null values in backward direction
axis = 0 -> fill null value according to columns
axis = 1 fill null values according to rows
Outlier Treatment
Outliers are the value that lies outside the data. If the data contain outliers means data is skewed there is extreme of smallest values in data columns.so if data contain outliers that means when performing some data analysis process the analysis is going in the wrong direction so, for this process, outlier treatment overcome to solve this problem.
Their are two different types of outlier treatment techniques.
- Interqurtile range ( IQR )
- Z-Score
Interquartile Range
It equally divides the distribution into four equal parts called quartiles.
- 25% is 1st quartile (Q1),
- The last one is 3rd quartile (Q3) and
- middle one is 2nd quartile (Q2) and it leaves out the extreme values.
How to calculate Interquartile range
2nd quartile (Q2) divides the distribution into two equal parts of 50%. So, basically it is the same as Median. The interquartile range is the distance between the third and the first quartile, or, in other words, IQR equals Q3 minus Q1
Formula:- IQR = Q3- Q1
Identify the Outlies Using IQR Method
As per a rule of thumb, observations can be qualified as outliers when they lie more than 1.5 IQR below the first quartile or 1.5 IQR above the third quartile. Outliers are values that “lie outside” the other values.
Outliers = Q1 – 1.5 * IQR OR
Outliers = Q3 + 1.5 * IQR
Advantage of IQR
- The main advantage of the IQR is that it is not affected by outliers because it doesn’t take into account observations below Q1 or above Q3.
- It might still be useful to look for possible outliers in your study.
Outliers are shown with the help of using a box plot.
Z-Score
Z-score is the number of standard deviations from the mean a data point is.
Formula:
Z Score = (x - μ) / σ
x: Value of the element
μ: Population mean
σ: Standard Deviation A
Note:- z-score of zero tells you the values are exactly average while a score of +3 tells you that the value is much higher than average.
Bell Shape Distribution and Empirical Rule: If the distribution is bell shape then it is assumed that about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2, and about 99% have a z-score between -3 and 3.
So Z-score of the any value is less than -3 or greater than +3 the value is considered as outliers.
Dealing With categorical data
The process of any statistical analysis of data depends on the mathematical calculation so when our data is categorical that is the problem is to calculate these mathematical terms or to pass our data into the machine learning model there is require numerical data. So it is an important process to convert these categorical data into numerical data to perform some analysis.
The following methods to transform these categorical data into numerical values.
- Label Encoding
- One Hot Encoding/Dummy variable.
Label Encoder.
Label encoder converts our categorical data into the numerical data it assign categorical value to the number start from zero.
Example:- Consider the bridge data set categorical columns.
After performing the label encoder the data value of bridge columns is converted into the numerical format.
Problem behind the label encoder
The problem using the number is that they introduce relations/comparison between them.
The algorithm might misunderstand that data has some kind of hierarchy/order 0 < 1 < 2 … < 6 and might give 6X more weight to into calculation
One Hot Encoding
In One Hot Encoding, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column.
Apply one-hot encoding into the same bridge type column.
Scaling and Transformation
Most machine learning algorithms take into account only the magnitude of the measurements, not the units of those measurements. So that is expressed in a very high magnitude (number), which may affect the prediction a lot more than an equally important feature.
Example :- you have two lengths, l1 = 250 cm and l2 = 2.5 m. We, humans, see that these two are identical lengths (l1 = l2), but most ML algorithms interpret this quite differently.
Consider following data set
In the above dataset contain age and salary column suppose you want to perform some machine learning model into this data the model facing problem because both columns contain different range so the feature scaling is required.
The following are two ways to perform feature scaling.
- Standardization
- Normalization
Standardization
Normalization
Splitting DataSet
Before applying machine learning models, we should split the data into two parts as the training set and test set. If we use 100% of the data gathered(full dataset) in training the model, we will be out of data for testing the accuracy of the model that we have built. So we generally split the dataset into a 70:30 or 80:20 ratio (trainset: test set). Special care needs to be taken in splitting the data.
Training Data
The Machine Learning model is built using the training data. The training data helps the model to identify key trends and patterns essential to predict the output.
Testing Data
After the model is trained, it must be tested to evaluate how accurately it can predict an outcome. This is done by the testing data set.