Table of Contents
Introduction Null Value Treatment in Python
While coding in Python, it is very common to assign or initialize variables with string, float, or integer values. But some you may want to assign a null value to a variable it is called as Null Value Treatment in Python. Unlike other programming languages such as PHP or Java or C, Python does not have a null value. Instead, there is the ‘None’ keyword that you can use to define a null value.
In the real world is that real-world data is more rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing.
In this article, we will discuss some general considerations for missing data(i.e. Null Value), discuss how Pandas chooses to represent it.
Identify NaN and None in Pandas
NaN and None both have represented as a null value, and Pandas is built to handle the two of them nearly interchangeably. The following example helps you how to interchange.
Example:-
pd.Series([1, np.nan, 2, None])
output : 0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
Pandas automatically type-casts when NA values are present. For example, if we set a value in an integer array to np.nan, it will automatically be changed to a floating-point type.
x = pd.Series(range(2), dtype=int)
print(x)
output:
0 0
1 1
dtype: int64
x[0] = None
print(x)
Output:
0 NaN
1 1.0
dtype: float64
Now you can see that in addition to casting the integer array to floating-point, Pandas automatically converts the None to a NaN value.
The following lists in Pandas when NA values are introduced:
Floating – No change | np.nan |
Object – No change | Np.nan or None |
Integer-cast to float 64 | np.nan |
Boolean-cast to object | Np.nan or None |
Always remember that in Pandas, string data is always stored with an object dtype.
Widget not in any sidebars
Detecting Null Values
As we have seen above example, Pandas treats None and NaN as indicating missing or null values. There are several useful methods for detecting, removing, and replacing null values in Pandas data structures.
The following list are:
Where,
- isnull(): check the any null value
- notnull(): Opposite of isnull()
Pandas data structures have two useful methods for detecting null data: isnull() and notnull().
For Example:-
df = pd.Series([1,2, np.nan, 'fireblaze', None])
##isnull()##
df.isnull()
output:
0 False
1 False
2 True
3 False
4 True
dtype: bool
##not null##
data[data.notnull()]
0 1
1 2
3 fireblaze
dtype: object
# To detect the number of null value used .isnull.sum()
data.isnull().sum()
# Output:-
2
Null Value Treatment in Python
Data contain null values for many reasons such as observing the data is not recorded, data corruption. So when your data containing the null value that means we don’t get the right analysis on our data and many of machine learning algorithm doesn’t support these missing values. That is the reason behind handling the missing values.
There are two important process to handling this missing value
- Dropping
- Imputation of null value
- Dropping Missing Value
Suppose the data column value contains more than 60% – 70% missing value that time we preferred that to drop that columns because the if you drop out that null value that means all are remaining columns value are also drop out so there are chances to data loss of rest of the columns, The second thing is that suppose column contains a limited number of null value and the related columns also contain the same number of null value then drop that null value using pandas dropna() function.
Syntax:-
DataFrame.dropna(axis = 0 / 1 , how = (‘all’/ ’any’),subset = [‘column name’], thresh = any number)
Where,
axis = 0 -> It is for check null value in rows
axis = 1 -> It check the null values in columns
how = ‘ all ’ -> It check the all row or column value are null then drop that
how = ‘any’ -> it check any single null value in row and column contain then drop it
thresh = it chcek at least number of non null values contain column/row or not. Ex. thresh = 2 it chech that that row or column contain non null value or not.
Example:-
df = pd.Series([1,2, np.nan, 'fireblaze', None])
##isnull()##
df.isnull()
output:
0 False
1 False
2 True
3 False
4 True
dtype: bool
#drop all null values from series
data.dropna()
Output:-
0 2
3 fireblaze
dtype: object
Drop null values in the data frame:-
So many options in the data frame. Let’s create a data frame.
Example:-
df = pd.DataFrame([[1, np.nan, 2],
[3, 4, 5],
[np.nan, 6, 7]])
print(df)
0 | 1 | 2 | |
0 | 1.0 | np.nan | 2.0 |
1 | 3.0 | 4.0 | 5.0 |
2 | np.nan | 6.0 | 7.0 |
We cannot drop single values from a DataFrame. We can only drop full rows or full columns from DataFrame. Depending on the problem statement. So dropna() gives a number of options for a DataFrame.
By default, dropna() will drop all rows in which any null value.
For Example:
df.dropna()
0 | 1 | 2 | |
1 | 3.0 | 4.0 | 5.0 |
Another Method:
You can drop NA values along a different axis; axis=1 drops all columns containing a null value:
Example:-
df.dropna(axis = 1)
2 | |
0 | 2.0 |
1 | 5.0 |
2 | 7.0 |
Another interesting method in dropping rows or columns with all NA values. This can be specified through ‘how’ or ‘thresh’ parameters, which allow fine control of the number of nulls to allow through.
The default value for how=’any’, such that any row or column containing a null(NaN) value will be dropped. You can also specify how=’all’, which will only drop rows/columns that are all null values.
Now, add all nan value in given DataFrame.
Example:-
print(df)
0 | 1 | 2 | 3 | |
0 | 1.0 | np.nan | 2.0 | NaN |
1 | 3.0 | 4.0 | 5.0 | NaN |
2 | np.nan | 6.0 | 7.0 | NaN |
df.dropna(axis=’columns’, how=’all’) #drop aloumn where all nan values.
0 | 1 | 2 | |
0 | 1.0 | np.nan | 2 |
1 | 3.0 | 4.0 | 5 |
2 | np.nan | 6.0 | 7 |
Let’s use the ‘thresh’ parameter, you specify a minimum number of non-null values for the row/column.
Example:-
df.dropna(axis=’rows’, thresh=3)
0 | 1 | 2 | 3 | |
0 | 1.0 | np.nan | 2.0 | NaN |
Here the first and last row have been dropped because we put the condition as ‘thresh=3’ it means at least two NaN values in row and column. They contain only two non-null values.
Widget not in any sidebars
Filling null values
Sometimes rather than dropping NA values, you’d rather replace them with a valid value. Every time dropping it is not good for all problem statements because of some useful data insight the other columns or row. Now, a better way to fill the null values and this is called as Null Value Treatment in Python. This value might be a single number like zero, mean, median, or mode. You could do this in-place using the isnull() method as a mask, but because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values replaced.
Generally, we fill null value of numerical data using mean and median and mode for categorical data.
When to used which measure of central tendency to fill null value?
- Mean – It is used when your data is not skewed ( i.e. normal distributed)
- Median – It is used when your data is skewed ( i.e.not normal distributed)
- Mode:- It is used when your data is skewed ( i.e.not normal distributed) mostly used for filling categorical null value.
Syntex:-
fillna(value,method=( ’ffill’ / ’bfill ’ ),axis = 0/1)
Method = ‘ffill’ -> it fill null values in forward direction
Method = ‘bfill’ -> it fill null values in backward direction
axis = 0 -> fill null value according to columns
axis = 1 fill null values according to rows
For Example:’
df= pd.Series([1, np.nan, 2, None, 3,None,4], index=list('abcdefg'))
print(df)
a 1.0
b NaN
c 2.0
d NaN
e 3.0
f NaN
g 4.0
dtype: float64
#Fill NA values with zero.
df.fillna(0)
a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
f 0.0
g 4.0
dtype: float64
#Fillthe value using forward fill. In another word, NA value can fill a forward(next) number.
Example:-
# forward-fill
data.fillna(method='ffill')
a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
f 3.0
g 4.0
dtype: float64
Another method is a back-fill to propagate the next values backward.
Example:-
# back-fill
data.fillna(method='bfill')
a 1.0
b 2.0
c 2.0
d 3.0
e 3.0
f 4.0
g 4.0
dtype: float64
Similarly, ffill and bfill apply on DataFrame.
So, Create DataFrame.
Example:-
df = pd.DataFrame([[1, np.nan, 2 , np.nan],
[3, 4, 5,np.nan],
[np.nan, 6, 7,np.nan],
])
print(df)
print(df)
0
1
2
3
0
1.0
np.nan
2.0
np.nan
1
3.0
4.0
5.0
np.nan
2
np.nan
6.0
7.0
np.nan
#ffill
df.fillna(method='ffill', axis=1)
0
1
2
3
0
1.0
1.0
2.0
2.0
1
3.0
4.0
5.0
5.0
2
np.nan
6.0
7.0
7.0
Note: if a previous value is not available during a forward fill, the NA value remains.
Fill null value using the mean of a particular column
Create data frame
df = DataFrame([[ 0, 1, np.nan, 3, 4],
[ 5, np.nan, 7, 8, 9],
[10, 11, 12, 13, np.nan],
[15, np.nan, 17, 18, 19],
[20, 21, 22, np.nan, np.nan]])
df
Output:-
0 1 2 3 4
0 0 1.0 NaN 3.0 4.0
1 5 NaN 7.0 8.0 9.0
2 10 11.0 12.0 13.0 NaN
3 15 NaN 17.0 18.0 19.0
4 20 21.0 22.0 NaN NaN
Example:-
# Check the mean value of 4 index column
mean_value= df[4].mean()
Mean_value
output:-
10.666666666666666
Pass mean value variable name into fillna() function to fill null value using the mean value of that particular columns
df[4].fillna(mean_value,inplace= True) # inplace = True for original change in dataframe
df
Output:-
0 1 2 3 4
0 0 1.0 NaN 3.0 4.000000
1 5 NaN 7.0 8.0 9.000000
2 10 11.0 12.0 13.0 10.666667
3 15 NaN 17.0 18.0 19.000000
4 20 21.0 22.0 NaN 10.666667
# Altrnative of above code fill null value using mean
# df[4].fillna(df[4].mean(),inplace= True) # inplace = True for original change in dataframe
Now this null value we fill using mean of data
df[4].fillna(mean_value,inplace= True) # inplace = True for original change in dataframe
df
Output:-
0 1 2 3 4
0 0 1.0 NaN 3.0 4.000000
1 5 NaN 7.0 8.0 9.000000
2 10 11.0 12.0 13.0 10.666667
3 15 NaN 17.0 18.0 19.000000
4 20 21.0 22.0 NaN 10.666667
# Altrnative of above code fill null value using mean
# df[4].fillna(df[4].mean(),inplace= True) # inplace = True for original change in dataframe
Now this null value we fill using mean of data
Example:-
df[1].fillna(df[1].median(),inplace = True)
df
Output:-
0 1 2 3 4
0 0 1.0 NaN 3.0 4.000000
1 5 11.0 7.0 8.0 9.000000
2 10 11.0 12.0 13.0 10.666667
3 15 11.0 17.0 18.0 19.000000
4 20 21.0 22.0 NaN 10.666667
Conclusion
The approach to deal with missing values is heavily dependent on the nature of data. In this article, we are learning about Null Value Treatment in Python. Therefore you are dealing with different types of data so used trial and error method.