Table of Contents
Introduction to Random Forest Algorithm
Random forest is a supervised learning algorithm. The “forest” it builds on multiple decision trees the process is known as the “bagging” method. The idea behind the bagging method is that a combination of learning models increases the overall result of the model.
Generally, the more trees in the forest the more robust the forest looks like. Similarly, in the random forest classifier algorithm, the higher the number of decision trees in the forest, the greater is the accuracy of the model.
The structure of the Random forest is look like below.
As the above graph, there is split the data into training and testing part and then apply random forest into training data now the random forest create multiple decision tree on the basis of our training data and get the accuracy on testing data it is calculated by the majority of data points in the output. Fro this reason it improves the accuracy of the model.
Let’s take example
The problem statement is that looking to buy a house, but you’re unable to decide which one to buy. For this reason, you visit the agent and give you the list of parameters that will help you to consider before buying a house. The list of parameters are:- House Price, Number of bedrooms, parking space, locality, available facilities. Read more about Ensemble Learning Techniques in Machine Learning
Let’s form a single decision tree
Now earlies said that Random forest is formed multiple decision tree it randomly selected the set of parameters for each different decision tree.
It look like below
Here we have created multiple decision trees i.e. 3 With using given 3 parameters in the dataset. Each decision tree predicts some output from data a variable and we get the aggregate of that value to predict the outcome in the random forest.
In Simple words after building the model of the random forest it to get the prediction to take majority value in case of classification and aggregate in case of regression.
So this is also a difference between Decision tree and Random Forest.
Why use random Forest?
When we are building a decision tree model there is only one tree that is from on the basis of training data the tree measures all are the training parameter but in case new data is come then accuracy getting very low.
This is because of the happening due to overfitting.
Overfitting:- overfitting happens when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy datasets so the accuracy of training data is high but the testing phase it’s very low. These models have low bias and high variance.
This problem can be solved by using the random forest because it works on different data trees in the dataset.
The Mathematics Behind Random Forest
Gini Index
Method to split out the data is the Gini index, it checks the impurity or purity of data it is used in the CART( Classification and Regression Tree) algorithm like Decision Tree.
It creates a binary split and the CART algorithm uses it to create a binary split.
An attribute is low Gini index is preferred as the root node
Formula to calculate Gini index is:-
Gini Index= 1- ∑jPj2
Information Gain
Information gain is calculated with the use of entropy in the data set and the attribute entropy, It gives us information about how much information a feature provides us with a class.
Entropy measures the impurity or randomness is present in the given data. It is used to decide the root node in the decision tree to split out the data.
Formula to calculate information gain
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature) ]
Highest the information gain we select that feature as the root node
Regression Problems
Random Forest Algorithm to solve regression problems that time you are using the mean squared error (MSE) value to how your data branches from each node.
Pros of Random Forest
- It is used for both Classification and Regression tasks.
- It handle large dataset with high dimensionality.
- Improve the accuracy of the model and prevents model to overfitting issue
Cons of Random Forest
- Slow prediction time once the model is created
- it is not more suitable for Regression tasks.
Implementation of Random forest
Here are what the columns represent:
* credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
* purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).
* int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be riskier are assigned higher interest rates.
* installment: The monthly installments owed by the borrower if the loan is funded.
* log.annual.inc: The natural log of the self-reported annual income of the borrower.
* dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
* fico: The FICO credit score of the borrower.
* days.with.cr.line: The number of days the borrower has had a credit line.
* revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
* revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
* inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.
* delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
* pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Get the Data
loans = pd.read_csv('loan_data.csv')
# Check out the info(), head(), and describe() methods on loans.
loans.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy 9578 non-null int64
purpose 9578 non-null object
int.rate 9578 non-null float64
installment 9578 non-null float64
log.annual.inc 9578 non-null float64
dti 9578 non-null float64
fico 9578 non-null int64
days.with.cr.line 9578 non-null float64
revol.bal 9578 non-null int64
revol.util 9578 non-null float64
inq.last.6mths 9578 non-null int64
delinq.2yrs 9578 non-null int64
pub.rec 9578 non-null int64
not.fully.paid 9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB
Output:-
credit.policy | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid | |
count | 9578.000000 | 9578.000000 | 9578.000000 | 9578.000000 | 9578.000000 | 9578.000000 | 9578.000000 | 9.578000e+03 | 9578.000000 | 9578.000000 | 9578.000000 | 9578.000000 | 9578.000000 |
mean | 0.804970 | 0.122640 | 319.089413 | 10.932117 | 12.606679 | 710.846314 | 4560.767197 | 1.691396e+04 | 46.799236 | 1.577469 | 0.163708 | 0.062122 | 0.160054 |
std | 0.396245 | 0.026847 | 207.071301 | 0.614813 | 6.883970 | 37.970537 | 2496.930377 | 3.375619e+04 | 29.014417 | 2.200245 | 0.546215 | 0.262126 | 0.366676 |
min | 0.000000 | 0.060000 | 15.670000 | 7.547502 | 0.000000 | 612.000000 | 178.958333 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 1.000000 | 0.103900 | 163.770000 | 10.558414 | 7.212500 | 682.000000 | 2820.000000 | 3.187000e+03 | 22.600000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 1.000000 | 0.122100 | 268.950000 | 10.928884 | 12.665000 | 707.000000 | 4139.958333 | 8.596000e+03 | 46.300000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 1.000000 | 0.140700 | 432.762500 | 11.291293 | 17.950000 | 737.000000 | 5730.000000 | 1.824950e+04 | 70.900000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 1.000000 | 0.216400 | 940.140000 | 14.528354 | 29.960000 | 827.000000 | 17639.958330 | 1.207359e+06 | 119.000000 | 33.000000 | 13.000000 | 5.000000 | 1.000000 |
loans.head()
credit.policy | purpose | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid | |
0 | 1 | debt_consolidation | 0.1189 | 829.10 | 11.350407 | 19.48 | 737 | 5639.958333 | 28854 | 52.1 | 0 | 0 | 0 | 0 |
1 | 1 | credit_card | 0.1071 | 228.22 | 11.082143 | 14.29 | 707 | 2760.000000 | 33623 | 76.7 | 0 | 0 | 0 | 0 |
2 | 1 | debt_consolidation | 0.1357 | 366.86 | 10.373491 | 11.63 | 682 | 4710.000000 | 3511 | 25.6 | 1 | 0 | 0 | 0 |
3 | 1 | debt_consolidation | 0.1008 | 162.34 | 11.350407 | 8.10 | 712 | 2699.958333 | 33667 | 73.2 | 1 | 0 | 0 | 0 |
4 | 1 | credit_card | 0.1426 | 102.92 | 11.299732 | 14.97 | 667 | 4066.000000 | 4740 | 39.5 | 0 | 1 | 0 | 0 |
# Exploratory Data Analysis
## Categorical Features
cat_feats = ['purpose']
final_data = pd.get_dummies(loans,columns=cat_feats,drop_first=True)
final_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 19 columns):
credit.policy 9578 non-null int64
int.rate 9578 non-null float64
installment 9578 non-null float64
log.annual.inc 9578 non-null float64
dti 9578 non-null float64
fico 9578 non-null int64
days.with.cr.line 9578 non-null float64
revol.bal 9578 non-null int64
revol.util 9578 non-null float64
inq.last.6mths 9578 non-null int64
delinq.2yrs 9578 non-null int64
pub.rec 9578 non-null int64
not.fully.paid 9578 non-null int64
purpose_credit_card 9578 non-null uint8
purpose_debt_consolidation 9578 non-null uint8
purpose_educational 9578 non-null uint8
purpose_home_improvement 9578 non-null uint8
purpose_major_purchase 9578 non-null uint8
purpose_small_business 9578 non-null uint8
dtypes: float64(6), int64(7), uint8(6)
memory usage: 1.0 MB
# Train Test Split
from sklearn.model_selection import train_test_split
X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
## Training the Random Forest model
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=600)
rfc.fit(X_train,y_train)
# Prediction and Evaluation
predictions = rfc.predict(X_test)
# Create Classification Report
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
# Output
precision recall f1-score support
0 0.85 1.00 0.92 2431
1 0.58 0.02 0.05 443
accuracy 0.85 2874
macro avg 0.71 0.51 0.48 2874
weighted avg 0.81 0.85 0.78 2874
print(confusion_matrix(y_test,predictions))
[[2423 8]
[ 432 11]]
Want to learn Data Science & Machine Learning and become a successful Data Scientist? Visit our WEBSITE for more information.