Null Value Treatment in Python

0
2757

Introduction Null Value Treatment in Python

While coding in Python, it is very common to assign or initialize variables with string, float, or integer values. But some you may want to assign a null value to a variable it is called as Null Value Treatment in Python. Unlike other programming languages such as PHP or Java or C, Python does not have a null value. Instead, there is the ‘None’ keyword that you can use to define a null value.

In the real world is that real-world data is more rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing.

In this article, we will discuss some general considerations for missing data(i.e. Null Value), discuss how Pandas chooses to represent it. 

Identify NaN and None in Pandas

NaN and None both have represented as a null value, and Pandas is built to handle the two of them nearly interchangeably. The following example helps you how to interchange.

Example:-

pd.Series([1, np.nan, 2, None])
output : 	0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

Pandas automatically type-casts when NA values are present. For example, if we set a value in an integer array to np.nan, it will automatically be changed to a floating-point type.

x = pd.Series(range(2), dtype=int)
print(x)
output:
0    0
1    1
dtype: int64

x[0] = None
print(x)
Output:
0    NaN
1    1.0
dtype: float64

Now you can see that in addition to casting the integer array to floating-point, Pandas automatically converts the None to a NaN value. 

The following lists in Pandas when NA values are introduced:

Floating  – No changenp.nan
Object – No changeNp.nan or None
Integer-cast to float 64np.nan
Boolean-cast to objectNp.nan or None

Always remember that in Pandas, string data is always stored with an object dtype.


Widget not in any sidebars

Detecting Null Values

As we have seen above example, Pandas treats None and NaN as indicating missing or null values. There are several useful methods for detecting, removing, and replacing null values in Pandas data structures. 

The following list are:

Where,

  • isnull(): check the any null value
  • notnull(): Opposite of isnull()

Pandas data structures have two useful methods for detecting null data: isnull() and notnull().

For Example:-

df = pd.Series([1,2, np.nan, 'fireblaze', None])

##isnull()##

df.isnull()
output:
0    	False
1	False
2     	True
3    	False
4     	True
dtype: bool

##not null##

data[data.notnull()]

0        1
1	 2
3    	fireblaze
dtype: object

# To detect the number of null value used .isnull.sum()

data.isnull().sum()

# Output:-
2

Null Value Treatment in Python

Data contain null values for many reasons such as observing the data is not recorded, data corruption. So when your data containing the null value that means we don’t get the right analysis on our data and many of machine learning algorithm doesn’t support these missing values. That is the reason behind handling the missing values.

There are two important process to handling this missing value 

  1. Dropping 
  2. Imputation of null value
  1. Dropping Missing Value 

Suppose the data column value contains more than 60% – 70% missing value that time we preferred that to drop that columns because the if you drop out that null value that means all are remaining columns value are also drop out so there are chances to data loss of rest of the columns, The second thing is that suppose column contains a limited number of null value and the related columns also contain the same number of null value then drop that null value using pandas dropna() function.

Syntax:-

DataFrame.dropna(axis = 0 / 1 , how = (‘all’/ ’any’),subset = [‘column name’], thresh = any number)

Where,

axis = 0 -> It is for check null value in rows

axis = 1 -> It check the null values in columns

how = ‘ all ’ -> It check the all row or column value are null then drop that 

how = ‘any’ -> it check any single null value in row and column contain then drop it

thresh  = it chcek at least number of non null values contain column/row or not. Ex. thresh = 2 it chech that that row or column contain non null value or not.

Example:-

df = pd.Series([1,2, np.nan, 'fireblaze', None])

##isnull()##

df.isnull()
output:
0    	False
1	False
2     	True
3    	False
4     	True
dtype: bool
#drop all null values from series

data.dropna()

Output:-

0        2
3    fireblaze
dtype: object

Drop null values in the data frame:-

So many options in the data frame. Let’s create a data frame.

Example:-

df = pd.DataFrame([[1,      np.nan, 2],
                   [3,      4,      5],
                   [np.nan, 6,      7]])
print(df)
012
01.0np.nan2.0
13.04.05.0
2np.nan6.07.0

We cannot drop single values from a DataFrame. We can only drop full rows or full columns from DataFrame. Depending on the problem statement. So dropna() gives a number of options for a DataFrame.

By default, dropna() will drop all rows in which any null value.

For Example:

df.dropna()


012
13.04.05.0 

Another Method:

You can drop NA values along a different axis; axis=1 drops all columns containing a null value:

Example:-

df.dropna(axis = 1)
2
02.0
15.0
27.0

Another interesting method in dropping rows or columns with all NA values. This can be specified through ‘how’ or ‘thresh’ parameters, which allow fine control of the number of nulls to allow through.

The default value for how=’any’, such that any row or column containing a null(NaN) value will be dropped. You can also specify how=’all’, which will only drop rows/columns that are all null values.

Now, add all nan value in given DataFrame.

Example:-

print(df)

0123
01.0np.nan2.0NaN
13.04.05.0NaN
2np.nan6.07.0NaN

df.dropna(axis=’columns’, how=’all’) #drop aloumn where all nan values.

012
01.0np.nan2
13.04.05
2np.nan6.07

Let’s use the ‘thresh’ parameter, you specify a minimum number of non-null values for the row/column.

Example:-

df.dropna(axis=’rows’, thresh=3)

0123
01.0np.nan2.0NaN

Here the first and last row have been dropped because we put the condition as ‘thresh=3’ it means at least two NaN values in row and column. They contain only two non-null values.


Widget not in any sidebars

Filling null values

Sometimes rather than dropping NA values, you’d rather replace them with a valid value. Every time dropping it is not good for all problem statements because of some useful data insight the other columns or row. Now, a better way to fill the null values and this is called as Null Value Treatment in Python.  This value might be a single number like zero, mean, median, or mode. You could do this in-place using the isnull() method as a mask, but because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values replaced.

Generally, we fill null value of numerical data using mean and median and mode for categorical data.

When to used which measure of central tendency to fill null value?

  • Mean – It is used when your data is not skewed ( i.e. normal distributed)
  • Median – It is used when your data is skewed ( i.e.not normal distributed)
  • Mode:-  It is used when your data is skewed ( i.e.not normal distributed) mostly used for filling categorical null value.

Syntex:-

fillna(value,method=( ’ffill’ / ’bfill ’ ),axis = 0/1)

Method = ‘ffill’ -> it fill null values in forward direction

Method = ‘bfill’ -> it fill null values in backward direction

axis = 0 -> fill null value according to columns

axis  = 1 fill null values according to rows

For Example:’

df= pd.Series([1, np.nan, 2, None, 3,None,4], index=list('abcdefg'))
print(df)

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
f    NaN
g    4.0
dtype: float64

#Fill NA values with zero.
df.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
f    0.0
g    4.0
dtype: float64

#Fillthe value using forward fill. In another word, NA value can fill a forward(next) number.
Example:-

	# forward-fill
data.fillna(method='ffill')
a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
f    3.0
g    4.0
dtype: float64
	Another method is a back-fill to propagate the next values backward.
Example:-

# back-fill
data.fillna(method='bfill')
a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
f    4.0
g    4.0
dtype: float64

Similarly, ffill and bfill apply on DataFrame. 
So, Create DataFrame.
Example:-

df = pd.DataFrame([[1,      np.nan, 2 , np.nan],
                   [3,      4,      5,np.nan],
                   [np.nan, 6,      7,np.nan],
                  ])
print(df)


print(df)



0
1
2
3
0
1.0
np.nan
2.0
np.nan
1
3.0
4.0
5.0
np.nan
2
np.nan
6.0
7.0
np.nan

#ffill
df.fillna(method='ffill', axis=1)




0
1
2
3
0
1.0
1.0
2.0
2.0
1
3.0
4.0
5.0
5.0
2
np.nan
6.0
7.0
7.0

Note: if a previous value is not available during a forward fill, the NA value remains.

Fill null value using the mean of a particular column


Create data frame
df = DataFrame([[ 0,  1,  np.nan,  3,  4],
           [ 5,  np.nan,  7,  8,  9],
           [10, 11, 12, 13, np.nan],
           [15, np.nan, 17, 18, 19],
           [20, 21, 22, np.nan, np.nan]])
df
Output:-


0	1	2	3	4
0	0	1.0	NaN	3.0	4.0
1	5	NaN	7.0	8.0	9.0
2	10	11.0	12.0	13.0	NaN
3	15	NaN	17.0	18.0	19.0
4	20	21.0	22.0	NaN	NaN



Example:-
# Check the mean value of 4 index column
mean_value=  df[4].mean()

Mean_value
output:-
10.666666666666666

Pass mean value variable name into fillna() function to fill null value using the mean value of that particular columns

df[4].fillna(mean_value,inplace= True)   # inplace = True for original change in dataframe

df
Output:-


0	1	2	3	4
0	0	1.0	NaN	3.0	4.000000
1	5	NaN	7.0	8.0	9.000000
2	10	11.0	12.0	13.0	10.666667
3	15	NaN	17.0	18.0	19.000000
4	20	21.0	22.0	NaN	10.666667

# Altrnative of above code fill null value using mean
# df[4].fillna(df[4].mean(),inplace= True)   # inplace = True for original change in dataframe

Now this null value we fill using mean of data

df[4].fillna(mean_value,inplace= True)   # inplace = True for original change in dataframe

df

Output:-

0 1 2 3 4
 0 0 1.0 NaN 3.0 4.000000
 1 5 NaN 7.0 8.0 9.000000
 2 10 11.0 12.0 13.0 10.666667
 3 15 NaN 17.0 18.0 19.000000
 4 20 21.0 22.0 NaN 10.666667
 # Altrnative of above code fill null value using mean
 # df[4].fillna(df[4].mean(),inplace= True)   # inplace = True for original change in dataframe

Now this null value we fill using mean of data

Example:-
df[1].fillna(df[1].median(),inplace = True)

df

Output:-


0	1	2	3	4
0	0	1.0	NaN	3.0	4.000000
1	5	11.0	7.0	8.0	9.000000
2	10	11.0	12.0	13.0	10.666667
3	15	11.0	17.0	18.0	19.000000
4	20	21.0	22.0	NaN	10.666667


Conclusion

The approach to deal with missing values is heavily dependent on the nature of data. In this article, we are learning about Null Value Treatment in Python. Therefore you are dealing with different types of data so used trial and error method.

LEAVE A REPLY

Please enter your comment!
Please enter your name here