Random Forest Algorithm in Machine Learning

0
1080
Random Forest Algorithm

Introduction to Random Forest Algorithm

Random forest is a supervised learning algorithm. The “forest” it builds on multiple decision trees the process is known as the “bagging” method. The idea behind the bagging method is that a combination of learning models increases the overall result of the model.

Generally, the more trees in the forest the more robust the forest looks like. Similarly, in the random forest classifier algorithm, the higher the number of decision trees in the forest, the greater is the accuracy of the model.

The structure of the Random forest is look like below.

Random Forest Algorithm

As the above graph, there is split the data into training and testing part and then apply random forest into training data now the random forest create multiple decision tree on the basis of our training data and get the accuracy on testing data it is calculated by the majority of data points in the output. Fro this reason it improves the accuracy of the model.

Let’s take example

The problem statement is that looking to buy a house, but you’re unable to decide which one to buy. For this reason, you visit the agent and give you the list of parameters that will help you to consider before buying a house. The list of parameters are:- House Price, Number of bedrooms, parking space, locality, available facilities. Read more about Ensemble Learning Techniques in Machine Learning

Let’s form a single decision tree

Now earlies said that Random forest is formed multiple decision tree it randomly selected the set of parameters for each different decision tree.

It look like below 

Here we have created multiple decision trees i.e. 3 With using given 3 parameters in the dataset. Each decision tree predicts some output from data a variable and we get the aggregate of that value to predict the outcome in the random forest. 

In Simple words after building the model of the random forest it to get the prediction to take majority value in case of classification and aggregate in case of regression.

So this is also a difference between Decision tree and Random Forest.

Why use random Forest?

When we are building a decision tree model there is only one tree that is from on the basis of training data the tree measures all are the training parameter but in case new data is come then accuracy getting very low. 

This is because of the happening due to overfitting.

Overfitting:- overfitting happens when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy datasets so the accuracy of training data is high but the testing phase it’s very low. These models have low bias and high variance.

This problem can be solved by using the random forest because it works on different data trees in the dataset.

The Mathematics Behind Random Forest

Gini Index

Method to split out the data is the Gini index, it checks the impurity or purity of data it is used in the CART( Classification and Regression Tree) algorithm like Decision Tree.

It creates a binary split and the CART algorithm uses it to create a binary split.

An attribute is low Gini index is preferred as the root node

Formula to calculate Gini index is:-

Gini Index= 1- ∑jPj2

Information Gain

Information gain is calculated with the use of entropy in the data set and the attribute entropy, It gives us information about how much information a feature provides us with a class.

Entropy formula - Decision Tree Algorithm - Edureka

Entropy measures the impurity or randomness is present in the given data. It is used to decide the root node in the decision tree to split out the data.

Formula to calculate information gain 

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature) ]

Highest the information gain we select that feature as the root node

Regression Problems

Random Forest Algorithm to solve regression problems that time you are using the mean squared error (MSE) value to how your data branches from each node.

Pros of Random Forest

  • It is used for both Classification and Regression tasks.
  • It handle large dataset with high dimensionality.
  • Improve the accuracy of the model and prevents model to overfitting issue

Cons of Random Forest

  • Slow prediction time once the model is created
  • it is not more suitable for Regression tasks.

Implementation of Random forest

Here are what the columns represent:

* credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

* purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).

* int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be riskier are assigned higher interest rates.

* installment: The monthly installments owed by the borrower if the loan is funded.

* log.annual.inc: The natural log of the self-reported annual income of the borrower.

* dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

* fico: The FICO credit score of the borrower.

* days.with.cr.line: The number of days the borrower has had a credit line.

* revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).

* revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).

* inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.

* delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

* pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).

# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Get the Data
loans = pd.read_csv('loan_data.csv')
# Check out the info(), head(), and describe() methods on loans.
loans.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy        9578 non-null int64
purpose              9578 non-null object
int.rate             9578 non-null float64
installment          9578 non-null float64
log.annual.inc       9578 non-null float64
dti                  9578 non-null float64
fico                 9578 non-null int64
days.with.cr.line    9578 non-null float64
revol.bal            9578 non-null int64
revol.util           9578 non-null float64
inq.last.6mths       9578 non-null int64
delinq.2yrs          9578 non-null int64
pub.rec              9578 non-null int64
not.fully.paid       9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB
 

Output:-

credit.policyint.rateinstallmentlog.annual.incdtificodays.with.cr.linerevol.balrevol.utilinq.last.6mthsdelinq.2yrspub.recnot.fully.paid 
count9578.0000009578.0000009578.0000009578.0000009578.0000009578.0000009578.0000009.578000e+039578.0000009578.0000009578.0000009578.0000009578.000000
mean0.8049700.122640319.08941310.93211712.606679710.8463144560.7671971.691396e+0446.7992361.5774690.1637080.0621220.160054
std0.3962450.026847207.0713010.6148136.88397037.9705372496.9303773.375619e+0429.0144172.2002450.5462150.2621260.366676
min0.0000000.06000015.6700007.5475020.000000612.000000178.9583330.000000e+000.0000000.0000000.0000000.0000000.000000
25%1.0000000.103900163.77000010.5584147.212500682.0000002820.0000003.187000e+0322.6000000.0000000.0000000.0000000.000000
50%1.0000000.122100268.95000010.92888412.665000707.0000004139.9583338.596000e+0346.3000001.0000000.0000000.0000000.000000
75%1.0000000.140700432.76250011.29129317.950000737.0000005730.0000001.824950e+0470.9000002.0000000.0000000.0000000.000000
max1.0000000.216400940.14000014.52835429.960000827.00000017639.9583301.207359e+06119.00000033.00000013.0000005.0000001.000000

loans.head()

credit.policypurposeint.rateinstallmentlog.annual.incdtificodays.with.cr.linerevol.balrevol.utilinq.last.6mthsdelinq.2yrspub.recnot.fully.paid
01debt_consolidation0.1189829.1011.35040719.487375639.9583332885452.10000
11credit_card0.1071228.2211.08214314.297072760.0000003362376.70000
21debt_consolidation0.1357366.8610.37349111.636824710.000000351125.61000
31debt_consolidation0.1008162.3411.3504078.107122699.9583333366773.21000
41credit_card0.1426102.9211.29973214.976674066.000000474039.50100
# Exploratory Data Analysis
## Categorical Features
cat_feats = ['purpose']
final_data = pd.get_dummies(loans,columns=cat_feats,drop_first=True)
final_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 19 columns):
credit.policy                 9578 non-null int64
int.rate                      9578 non-null float64
installment                   9578 non-null float64
log.annual.inc                9578 non-null float64
dti                           9578 non-null float64
fico                          9578 non-null int64
days.with.cr.line             9578 non-null float64
revol.bal                     9578 non-null int64
revol.util                    9578 non-null float64
inq.last.6mths                9578 non-null int64
delinq.2yrs                   9578 non-null int64
pub.rec                       9578 non-null int64
not.fully.paid                9578 non-null int64
purpose_credit_card           9578 non-null uint8
purpose_debt_consolidation    9578 non-null uint8
purpose_educational           9578 non-null uint8
purpose_home_improvement      9578 non-null uint8
purpose_major_purchase        9578 non-null uint8
purpose_small_business        9578 non-null uint8
dtypes: float64(6), int64(7), uint8(6)
memory usage: 1.0 MB

# Train Test Split
from sklearn.model_selection import train_test_split
X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
## Training the Random Forest model
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=600)
rfc.fit(X_train,y_train)
# Prediction and Evaluation
predictions = rfc.predict(X_test)
# Create Classification Report
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
# Output
             precision    recall  f1-score   support

           0       0.85      1.00      0.92      2431
           1       0.58      0.02      0.05       443

    accuracy                           0.85      2874
   macro avg       0.71      0.51      0.48      2874
weighted avg       0.81      0.85      0.78      2874


print(confusion_matrix(y_test,predictions))
[[2423    8]
 [ 432   11]]

Want to learn Data Science & Machine Learning and become a successful Data Scientist? Visit our WEBSITE for more information.

LEAVE A REPLY

Please enter your comment!
Please enter your name here