Cause and effect: Regression Analysis

Cause and effect

The correlation theory of statistics is used to model the cause and effect among variables. The causation implies correlation, but correlation along does not implies caustion. A correlation between two variables alone cannot prove that there is a causation effect. The correlation is only a mathematical relationship and it does not necessarily signify a cause and effect relationship between the variables. For any two correlated variables A and B, the following relationships are possible:
  1. A causes B
  2. B causes A
  3. Both A and B are affected by a common cause (third variable)  but do not cause each other
  4. Both A and B are mutually affecting each other so than neither of them could be designated as a cause or effect
  5. No connection between A and B, the correlation may be due to random or chance factors
The correlation is only a mathematical relationship and it does not necessarily signify a cause and effect relationship between the variables.
  

Simple Linear Regression

Regression Analysis deals in establishing statistical model of relationship of a variable with the explanatory variables. Following are the steps in this process.
Step 1 : Choose predictor or explanatory variables. The correlation analysis is done to choose the predictor model. The pair of two variables with higher coefficient of correlation indicates strong relation. The coefficient of correlation varies between -1.0 to 1.0


Many times, time is also one of the variables but it may just be a proxy for other variables. Time should be considered as predictor variable when there is a seasoning affect to be accounted.

Step 2: Regression Analysis -> confidence level & prediction
Define a regression model between the variables. The Simple Linear Regression model predict the variables  by following equation

Yi = m Xi + c
[c= intercept, m = sensitivity of Xi]

The difference between the observed value and the predicted value from regression model is termed as residual.
residual = actual - predicted value of the variable
standard residual = (residual - mean of residuals) / standard deviation of residuals
SST = sum of squared differences between observed value and mean value
SSR = sum of squared differences between predicted value and mean value
SSE = sum of squared differences between observed value and predicted value
SST = SSR + SSE
Note: [sum of squared of deviation] varies in 0 to 1
Standard error of estimate is similar to standard deviation but it measures the variations around prediction line (not around mean value) = Square root of (SSE / count)
R2 (Coefficient of Determination) is the proportion of variation in Y explained by variation in
explanatory variable(s) through regression relation.
R2 = SSR / SST
Square-root of R2 gives the correlation between actual and predicted values and termed as Multiple-R


Step 3: Summarize all (all predictor variables) the models in terms of three measures which are Adjusted R square, Durbin Watson, and MAPE. Based on this choose the best model that can predict best results.

Adjusted R2: Modified version of R2 which penalizes a model for including redundant the explanatory variablesIt is a proportion of variation in Y explained by variation in X. It is a measure of goodness of model. The value more closer to 1 mean more tighter prediction of the variable by the model. Adjusted R2 is more often used (compared to R2)

Durbin-Watson Statistics
If you plot residuals against predicted values, it should be random. For the time-series based data, Durbin-Watson statistics is used to evaluate the randomness of the residuals. DW is sum of squares of current and previous residuals divided by sum of squares of residuals. DW varies between 0 to 4. The value between 1.5 to 2.5 indicated errors are serially uncorrelated. 

Absolute percentage error = (Actual - Predicted) / Actual value of the variable

Mean absolute percentage error (MAPE) is a average of absolute percentage error. It is used to measure a predictive ability of a model.


Additional read:
Dummy variable correlation analysis: http://en.wikipedia.org/wiki/Dummy_variable_(statistics)
Cause and effect: Regression Analysis Cause and effect: Regression Analysis Reviewed by Sourabh Soni on Wednesday, December 05, 2012 Rating: 5

1 comment

  1. Link from MIT Open courseware:
    http://ocw.mit.edu/courses/sloan-school-of-management/15-075j-statistical-thinking-and-data-analysis-fall-2011/lecture-notes/

    ReplyDelete

Author Details

Image Link [https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZYEKEHJPev0oC4dyp_vZFA3Q6PM99sbRGRgel5lr3s9PJPKQORaMDhc5f0wLqZjHSE79OnUom2STt1asn17AKrN2FPD6gH6gjz4sCmL-fCfCp5ksFbAT6sqxx02KLzi2C_Q2kSMTtQhIM/s1600/sourabhdots3.jpg] Author Name [Sourabh Soni] Author Description [Technocrat, Problem Solver, Corporate Entrepreneur, Adventure Enthusiast] Facebook Username [sourabh.soni.587] Twitter Username [sourabhs271] GPlus Username [#] Pinterest Username [#] Instagram Username [#] LinkedIn Username [sonisourabh] Youtube Username [sonisourabh] NatGeo Username [271730]