Assumptions Of Linear Regression Algorithm

0
2458

Introduction To Assumptions Of Linear Regression

To Understand What are the assumptions of linear regression, Firstly We need to understand what is “Linear Regression” is a statistical method to regress the data with a dependent variable having continuous values whereas independent variables can have either continuous or categorical values. In other words “Linear Regression” is a method to predict the dependent variable (Y) based on values of independent variables (X). It can be used for the cases where we want to predict some continuous quantity.

There are five different types of Assumptions in linear regression.

  1. Linear Relationship
  2. No Autocorrelation
  3. Multivariate Normality
  4. Homoscedasticity 
  5. No or low Multicollinearity 

Linear Relationship 

First, linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots. 

The graph shows that the money invested in Tv advertising increases the sale also increases linearly it means there is a linear relation between Tv advertising and sales.

Test to check linear relationship

Graphical Method.

First:- To show to graph using predicted and actual value.

Second:- To show to graph using predicted and residual value

Statistical Test

Rainbow test:- This test is used for checking the linearity of regression applied. Here we define two hypothesis statement.

Null Hypo – Regression is Linear

Alternate hypo – Regression is non-Linear

With the help of using statsmodels.api library we check the linearity of regression.

Example:- 

import statsmodels.api as sm
sm.stats.diagnostic.linear_rainbow(res=lin_reg) 

It gives us the p-value and then the p-value is compared to the significance value(α) which is 0.05. If the p-value is greater than the significance value then consider that the failure to reject the null hypothesis i.e.  Regression is Linear, if it is greater then reject the null hypothesis i.e Regression is not linear.

Little or No autocorrelation

No or low autocorrelation is the second assumption in assumptions of linear regression. The linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent of each other. In other words when the value of y(x+1) is independent of the value of y(x). 

If the values of a column or feature are correlated with values of that same column then it is said to be autocorrelated, In other words, Correlation within a column.

Test to check autocorrelation

Graphical method.

Plot the ACF plot of residual to check the autocorrelation in data.

If the graph look like cyclic graph their means residuals contain positive autocorrelation, If the graph look like alternative graph their means residuals contain negative autocorrelation

Graph shows the positive autocorrelation because it look like the cyclic graph 

Durbin-Watson(DW) Test:- 

Durbin-Watson(DW) Test is Generally used to check the Autocorrelation.

Durbin Watson Test Can be defined as:-

In Durbin-Watson test statistic is approximately equal to 2*(1-r) where r is the sample autocorrelation of the residuals in the model. Thus, for r == 0, indicating no serial correlation, the test statistic equals 2.

Range of Durbin Watson Test from 0 to 4, where 0-2 shows positive Autocorrelation 2 means NO Autocorrelation and 2-4 means Negative Autocorrelation.

Multivariate Normality

Multivariate Normality is the third assumption in assumptions of linear regression. The linear regression analysis requires all variables to be multivariate normal. Means data should be normally distributed. As sample sizes increase then the normality for the residuals is not needed. If we take a repeated sampling from our population data, for large sample sizes data, the distribution (across repeated samples data) of the ordinary least squares estimates of the regression coefficients of the model follows a normal distribution, for moderate to large sample sizes, non-normality of residuals should not adversely affect the usual inferential procedures. This result is a consequence of an extremely important result in statistics, known as the central limit theorem.

Test to check multivariate normal

To check the normality use the q-q plot we can infer if the data comes from a normal distribution. If the data is normally distributed then it gets a fairly straight line. if it not normal then seen with deviation in the straight line

Example Code is:-

import statsmodels.api as sm
sm.qqplot(lin_reg.resid,fit= True,line = 40)

Jarque Bera Test:This test is for the goodness of fit test of whether the sample data have skewness and kurtosis matching to normal distribution or not.

Here we define null and alternative hypotheses.

Null Hypothesis – Error terms are normally distributed.

Alternate Hypo  – Error terms are not normally distributed.

In this test, we find the Probability Value(p-value) of residual and then compare it with the Significant value(ie 5.99).

Code:-

from scipy import stats
stats.jarque_bera(lin_reg.resid)

Widget not in any sidebars

 Homoscedasticity 

Homoscedasticity is the fourth assumption in assumptions of linear regression. Homoscedasticity describes a situation in which the error term ( the “noise” or random disturbance in the relationship between the independent and the target) is the same across all values of the independent variables. A scatter plot of residual values vs predicted values is a good way to check for homoscedasticity.

If the variance of the residual is symmetrically distributed across the residual line then data is said to be homoscedastic.

If the variance is unequal for residual, across the residual line then the data is said to be heteroscedasticity. In this case, the residual can form bow-tie, arrow, or any non-symmetric shape.

Test to check multivariate normal

Graphical method.

Draw regplot on the basis of predicted and residual to check homoscedasticity

Example:-
sns.regplot(x=lin_reg.predict(X_constant),y=lin_reg.resid)
plt.show()


Goldfeld Test or Beusch wagon Test:- This test used to check Homosedasticity. Here we define Null And Alternative Hypothesis.
Null hypothesis:- variance is constant across the range of data(ie Homosedacity)
Alternate hypothesis:- variance is not constant across the data(ie Hetrosedacity)


Example Code:-
import statsmodels.stats.api as sms
sms.het_goldfeldquandt(lin_reg.resid,lin_reg.model.exog)   #exog is all of input parameter

It gives us the p-value and then the p-value is compared to the significance value(α) which is 0.05. If the p-value is greater than the significance value then consider that the failure to reject the null hypothesis, if it is greater then reject the null hypothesis.


Widget not in any sidebars

No or low Multicollinearity 

No or low Multicollinearity is the fifth assumption in assumptions of linear regression. It refers to a situation where a number of independent variables in a multiple regression model are closely correlated to one another. Multicollinearity generally occurs when there are high correlations between two or more predictor variables. In other words, one predictor variable can be used to predict the other. This creates redundant information, skewing the results in a regression model. 

Test to check multicollinearity

Correlation coefficients:- An easy way to detect multicollinearity is to calculate correlation coefficients for all pairs of predictor variables. If the correlation coefficient, r, is exactly +1 or -1, this is called perfect multicollinearity. If r is close to or exactly -1 or +1, one of the variables should be removed from the model if at all possible.

Variance Inflation Factor( VIF ):- Another method to check multicollinearity is that Variance Inflation Factor is the quotient of variance in a model of multiple-term by the variance of the model with one term. It tells about multicollinearity(ie ratio). 

 VIF = 1/T

Where T is the Tolerance it measures the influence of one independent variable to all other independent variables. The tolerance is calculated with an initial regression analysis. Ia defined as the T = 1 – R² for the first step regression analysis. If the Tolerance is less than (T< 0.1) there might be multicollinearity in the data and if the Tolerance is less than 0.01(T < 0.01) there certainly is.

If the VIF is 1 means data is not correlated if it is Between 1 To 5 there is moderately Correlated or greater than 5 is highly correlated.

Why removing highly correlated features is important?

The independent variable contains a stronger correlation, the more difficult it is to change one feature without changing another feature. It becomes difficult for the model to estimate the relationship between each independant variable and the target variable independently because the features tend to change in unison.


Widget not in any sidebars

How to remove multicollinearity in data?

Suppose we have two features that are highly correlated then drop one feature from it and take another otherwise combine the two features and form new features.

Example Code:-
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif=[variance_inflation_factor(X_constant.values,i) for i in range(X_constant.shape[1])] 
# Check correlation use below code
sns.heatmap(bos.corr(),annot=True)

Implementation of Assumption test using Stats model library.

We used the Advertising dataset to test the assumption of linear regression. Link of data set. In this dataset contain a TV, Radio, Newspaper Advertising investment, and according to their sale.

# Import required libraries
# Import Required Library
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns   # For Visualization
import matplotlib.pyplot as plt
sns.set(context="notebook", palette="Spectral", style = 'darkgrid' ,font_scale = 1.5, color_codes=True)
import warnings
warnings.filterwarnings('ignore')
# Load DataSet
# Load DataSet
ad_data = pd.read_csv('Advertising.csv',index_col='Unnamed: 0')
ad_data.head()   # Check The Head Of Data
# Output
TVRadioNewspaperSales
1230.137.869.222.1
244.539.345.110.4
317.245.969.39.3
4151.541.358.518.5
5180.810.858.412.9
# Take Independant and Dependant Variable
x = ad_data.drop(["Sales"],axis=1)
y = ad_data.Sales
import statsmodels.api as sm    # for apply OLS model import this library

# Apply OLS Model 
X_constant=sm.add_constant(x)                       # Add X Constaant to independant  variable
lin_reg=sm.OLS(y,X_constant).fit()                  # Fit the data into OLS Model
lin_reg.summary()                                   # Check Summary of model
Output:-

Widget not in any sidebars
Dep. Variable:SalesR-squared:0.897
Model:OLSAdj. R-squared:0.896
Method:Least SquaresF-statistic:570.3
Date:Mon, 06 Jul 2020Prob (F-statistic):1.58e-96
Time:21:44:42Log-Likelihood:-386.18
No. Observations:200AIC:780.4
Df Residuals:196BIC:793.6
Df Model:3
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
const2.93890.3129.4220.0002.3243.554
TV0.04580.00132.8090.0000.0430.049
Radio0.18850.00921.8930.0000.1720.206
Newspaper-0.00100.006-0.1770.860-0.0130.011
Omnibus:60.414Durbin-Watson:2.084
Prob(Omnibus):0.000Jarque-Bera (JB):151.241
Skew:-1.327Prob(JB):1.44e-33
Kurtosis:6.332Cond. No.454.
# It gives us all stastical measure of indepndant variable dependant variable and residual of model
#  Assumptions of Linear Regression
No Autocorrelation
Multivariate Normality
Linear Relationship
Homoscedasticity 
No or low Multicollinearity 
# Autocorrelation
import statsmodels.tsa.api as smt   # Load Library for drow ACF plot to check AutoCorrelation
acf = smt.graphics.plot_acf(lin_reg.resid)    # pass residual into the acf function 
acf.show()

# The value of the Durbin Watson test is also close to 2. (i.e.DW=2.084) So the data in No AutoCorrelation

# Multivariate Normality
# Draw Distribution plot To Check Noramality
sns.distplot(lin_reg.resid)
plt.show()

# Q Qplot To Check Normality
import statsmodels.api as sm
Q_Qplot = sm.qqplot(lin_reg.resid, fit=True)


# Check Normality Using Jarque Bera Test
from scipy import stats
stats.jarque_bera(lin_reg.resid)   #left value is p-value
Output:-
(151.2414204760376, 0.0)

# Here we will Accept the Null hypothesis because the p-value is greater than significant value (151.24 > 5.99).

# Linear Relationship
# Plot Predict Vs Actual To check linearity 
sns.regplot(x=lin_reg.predict(X_constant),y=y)
plt.show()

# Here we see whether all the points are on the line if yes then data is linear
# Plot Predict Vs Residual To Check Linearity 
 
sns.regplot(x=lin_reg.predict(X_constant),y=lin_reg.resid)
plt.show()

# Here we see whether all the points are on the line if yes then data is linear
# Plot Predict Vs Residual To Check Linearity 
 
sns.regplot(x=lin_reg.predict(X_constant),y=lin_reg.resid)
plt.show()

# From above as it forms a straight line on which most of the points are lying so linear
# Rainbow Test for Linearity

import statsmodels.api as sm

sm.stats.diagnostic.linear_rainbow(res=lin_reg) 
#2nd value is p-value
Output:- (0.8896886584728811, 0.7185004116483391)
# Here we will accept the Null hypothesis because Pvalue>0.05
# Homosedacticity Test
# Draw regplot of predicted vs Residual
sns.regplot(x=lin_reg.predict(X_constant),y=lin_reg.resid)

# From above as it forms a straight line on which most of the points are lying so linear
# Goldfeld Test or Beusch wagon Test

import statsmodels.stats.api as sms
sms.het_goldfeldquandt(lin_reg.resid,lin_reg.model.exog)   #exog is all of input parameter
# Middle Value is the p-value which is > 0.05 hence we will fail reject null hypothesis
# Multi Correlinearity

# Calculate VIF 
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif=[variance_inflation_factor(X_constant.values,i) for i in range(X_constant.shape[1])] 

# here firstly, X_constant.shape[1] gives us the number of columns in X_constant for
# which the no of time the loop will run.
# now each column will get compare with all columns.
vif
Output:-
from pandas import Series,DataFrame
df=DataFrame(vif,index=X_constant.columns,columns=['vif'])

df

Output:-
vif
const6.848900
TV1.004611
Radio1.144952
Newspaper1.145187
# Find Correlation Of independent features 
plt.figure(figsize=(10,8))
sns.heatmap(ad_data.corr(),annot=True)
plt.show()

plt.show()

Each and every parameter are low correlated with each other so there is No Multi Correlinearity in Our data 


Widget not in any sidebars

Conclusion:– In this blog, you will get a better understanding of the assumption of linear regression and which test is used to check that assumption and their statistical way this will give you more confidence to solve any type of regression problem.

LEAVE A REPLY

Please enter your comment!
Please enter your name here