Table of Contents
Introduction Dealing with Categorical Data
In this article we are discussing about Dealing with Categorical Data in Python, These generally include different categories means numerical or categorical and it’s associated with the observation, which is non-numerical and thus needs to be converted to the computer can process them. In particular, many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms.
Identifying Categorical Data
- Nominal
- Ordinal
- Continuous
Categorical features can only take on a limited.
For example, if a dataset is about information related to users, then you will typically find features like country, sex, fruit_name, etc. These are all categorical features in your dataset. These features are text values. For example, sex is described as Male (M) or Female (F), product type could be described as electronics, apparel, food, etc.
these types of features where the categories are only labeled without any order(randomly) of precedence are called nominal features.
Features that have some order associated with them are called ordinal features. For example, a feature like credit score, with three categories: low, medium, and high, which have an order associated with them.
There are also continuous features. These are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or also a date/time.
Machine learning models, such as regression, or SVM (support vector machine) are algebraic. This means that their input must be numerical. Or categories must be transformed into numeric first before you can apply the learning algorithm on them.
For the machine, categorical data doesn’t contain the same context or information that humans can easily understand. For example, when looking at a feature called City with three cities New York, New Jersey, and New Delhi, humans can easily differentiate that New York is closely related to New Jersey as they are from the same country, while New York and New Delhi are much different. But for the model, New York, New Jersey, and New Delhi, are just three different levels (possible values) of the same feature City. If you don’t specify the additional contextual information, it will be impossible for the model to differentiate between highly different levels.
One of the most common ways to convert the numeric transformation is to one-hot encode the categorical features, especially when there does not exist a natural ordering between the categories (e.g. a feature ‘City’ with names of cities such as ‘Mumbai’, ‘Nagpur’, ‘Delhi’, etc.). For each unique value of a feature (say, ‘Mumbai’) one column is created (say, ‘City Mumbai’) where the value is 1 if for that instance the original feature takes that value and 0 otherwise
import pandas as pd
import numpy as np
# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
"num_doors", "body_style", "drive_wheels", "engine_location",
"wheel_base", "length", "width", "height", "curb_weight",
"engine_type", "num_cylinders", "engine_size", "fuel_system",
"bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
"city_mpg", "highway_mpg", "price"]
# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
header=None, names=headers, na_values="?" )
df.head()
_____________________________________________________________________
df.dtypes
Output:-
symboling int64
normalized_losses float64
make object
fuel_type object
aspiration object
num_doors object
body_style object
drive_wheels object
engine_location object
wheel_base float64
length float64
width float64
height float64
curb_weight int64
engine_type object
num_cylinders object
engine_size int64
fuel_system object
bore float64
stroke float64
compression_ratio float64
horsepower float64
peak_rpm float64
city_mpg int64
highway_mpg int64
price float64
dtype: object
Since this article will only focus on encoding the categorical variables, Pandas has a helpful select_dtypes function.
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()
output:-
make | fuel_type | aspiration | num_doors | body_style | drive_wheels | engine_location | engine_type | num_cylinders | fuel_system | |
0 | alfa-Romero | gas | std | two | convertible | rwd | front | dohc | four | mpfi |
1 | alfa-romero | gas | std | two | convertible | rwd | front | dohc | four | mpfi |
2 | alfa-romero | gas | std | two | hatchback | rwd | front | ohcv | six | mpfi |
3 | audi | gas | std | four | sedan | fwd | front | ohc | four | mpfi |
4 | audi | gas | std | four | sedan | 4wd | front | ohc | five | mpfi |
There are four ways to Dealing with Categorical Data in Python.
Method 1 – Find and Replace
There are two columns of data where the values are categorically used to represent numbers. Specifically the number of cylinders in the engine and number of doors on the car. Pandas make it easy for us to directly replace the text values with their numeric equivalent by using replace.
The number of cylinders only includes 7 values and num_doors data only includes 2 or 4 doors.
obj_df[“num_cylinders”].value_counts()
four 159
six 24
five 11
eight 5
two 4
twelve 1
three 1
Name: num_cylinders, dtype: int64
cleanup_nums = {“num_doors”: {“four”: 4, “two”: 2},
“num_cylinders”: {“four”: 4, “six”: 6, “five”: 5, “eight”: 8,
“two”: 2, “twelve”: 12, “three”:3 }}
To convert the columns to numbers using replace :
obj_df.replace(cleanup_nums, inplace=True)
obj_df.head()
output:
The benefit of this approach is that pandas “knows” the types of values in the columns so the object is now an int64.
obj_df.dtypes
make object
fuel_type object
aspiration object
num_doors int64
body_style object
drive_wheels object
engine_location object
engine_type object
num_cylinders int64
fuel_system object
dtype: object
Method 2 – Label Encoding
Encoding categorical values is to use a technique called label encoding. Label encoding is simply converting each value in a column to a number.
One trick you can use in pandas is to convert a column to a category, then use those category values for your label encoding:
obj_df[“body_style”] = obj_df[“body_style”].astype(‘category’)
obj_df.dtypes
Output:
make object
fuel_type object
aspiration object
num_doors int64
body_style category
drive_wheels object
engine_location object
engine_type object
num_cylinders int64
fuel_system object
dtype: object
Then you can assign the encoded variable to a new column using the cat.codes.
obj_df[“body_style_cat”] = obj_df[“body_style”].cat.codes
obj_df.head()
Output:
The approach is that you get the benefits of pandas categories (compact data size, ability to order, plotting support) but can easily be converted to numeric values for further analysis.
Method 3 – One Hot Encoding
Label encoding has a disadvantage that the numeric values can be “misinterpreted” by the algorithms. For example, the value of 0 is obviously less than the value of 4 but does that really correspond to the data set in real life? Does a wagon have “4X” more weight in our calculation than the convertible?
A common alternative approach is called one-hot encoding. Despite the different names, the basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column.
Pandas support this feature using get_dummies.
Hopefully, a simple example will make this more clear. We can look at the column drive_wheels where we have values of 4wd , fwd or rwd . With the help of get_dummies, we can convert this to three columns with a 1 or 0.
pd.get_dummies(obj_df, columns=[“drive_wheels”]).head()
The new data set contains three new columns:
- drive_wheels_4wd
- drive_wheels_rwd
- drive_wheels_fwd
One hot encoding is very useful but it can cause the number of columns to expand greatly if you have very many unique values in a column.
Method 4 – Custom Binary Encoding
Depending on the data set, sometimes use the combination of label encoding and one-hot encoding to create a binary column that meets your needs for further analysis.
obj_df[“engine_type”].value_counts()
ohc 148
ohcf 15
ohcv 13
l 12
dohc 12
rotor 4
dohcv 1
Name: engine_type, dtype: int64
In other words, the various versions of OHC are all the same for this analysis. If this is the case, then we could use the str accessor plus np.where to create a new column the indicates whether or not the car has an OHC engine.
Conclusion
In this article we discussed about Dealing with Categorical Data, Encoding categorical variables is an important step in the data science and data analysis process. Because there are a number of methods to encoding variables, it is important to understand the various options and how to implement them on your own data sets. Become a data scientist with Data Science & Machine Learning Program from Fireblaze AI School.