Cause and effect: Regression Analysis
Cause and effect
The correlation theory of statistics is used to model the cause and effect among variables. The causation implies correlation, but correlation along does not implies caustion. A correlation between two variables alone cannot prove that there is a causation effect. The correlation is only a mathematical relationship and it does not necessarily signify a cause and effect relationship between the variables. For any two correlated variables A and B, the following relationships are possible:
- A causes B
- B causes A
- Both A and B are affected by a common cause (third variable) but do not cause each other
- Both A and B are mutually affecting each other so than neither of them could be designated as a cause or effect
- No connection between A and B, the correlation may be due to random or chance factors
The correlation is only a mathematical relationship and it does not necessarily signify a cause and effect relationship between the variables.
Simple Linear Regression
Regression Analysis deals in establishing statistical model of relationship of a variable with the explanatory variables. Following are the steps in this process.
Step 1 : Choose predictor or explanatory variables. The correlation analysis is done to choose the predictor model. The pair of two variables with higher coefficient of correlation indicates strong relation. The coefficient of correlation varies between -1.0 to 1.0
Many times, time is also one of the variables but it may just be a proxy for other variables. Time should be considered as predictor variable when there is a seasoning affect to be accounted.
Many times, time is also one of the variables but it may just be a proxy for other variables. Time should be considered as predictor variable when there is a seasoning affect to be accounted.
Step 2: Regression Analysis -> confidence level & prediction
Define a regression model between the variables. The Simple Linear Regression model predict the variables by following equation
Yi = m Xi + c
[c= intercept, m = sensitivity of Xi]
The difference between the observed value and the predicted value from regression model is termed as residual.
Durbin-Watson Statistics
Define a regression model between the variables. The Simple Linear Regression model predict the variables by following equation
Yi = m Xi + c
[c= intercept, m = sensitivity of Xi]
The difference between the observed value and the predicted value from regression model is termed as residual.
residual = actual - predicted value of the variable
standard residual = (residual - mean of residuals) / standard deviation of residuals
SST = sum of squared differences between observed value and mean value
SSR = sum of squared differences between predicted value and mean value
SSE = sum of squared differences between observed value and predicted value
SST = SSR + SSE
Note: [sum of squared of deviation] varies in 0 to 1
Standard error of estimate is similar to standard deviation but it measures the variations around prediction line (not around mean value) = Square root of (SSE / count)
R2 (Coefficient of Determination) is the proportion of variation in Y explained by variation in
explanatory variable(s) through regression relation.
explanatory variable(s) through regression relation.
R2 = SSR / SST
Square-root of R2 gives the correlation between actual and predicted values and termed as Multiple-R
Step 3: Summarize all (all predictor variables) the models in terms of three measures which are Adjusted R square, Durbin Watson, and MAPE. Based on this choose the best model that can predict best results.
Adjusted R2: Modified version of R2 which penalizes a model for including redundant the explanatory variables. It is a proportion of variation in Y explained by variation in X. It is a measure of goodness of model. The value more closer to 1 mean more tighter prediction of the variable by the model. Adjusted R2 is more often used (compared to R2)
Step 3: Summarize all (all predictor variables) the models in terms of three measures which are Adjusted R square, Durbin Watson, and MAPE. Based on this choose the best model that can predict best results.
Adjusted R2: Modified version of R2 which penalizes a model for including redundant the explanatory variables. It is a proportion of variation in Y explained by variation in X. It is a measure of goodness of model. The value more closer to 1 mean more tighter prediction of the variable by the model. Adjusted R2 is more often used (compared to R2)
Durbin-Watson Statistics
If you plot residuals against predicted values, it should be random. For the time-series based data, Durbin-Watson statistics is used to evaluate the randomness of the residuals. DW is sum of squares of current and previous residuals divided by sum of squares of residuals. DW varies between 0 to 4. The value between 1.5 to 2.5 indicated errors are serially uncorrelated.
Absolute percentage error = (Actual - Predicted) / Actual value of the variable
Mean absolute percentage error (MAPE) is a average of absolute percentage error. It is used to measure a predictive ability of a model.
Dummy variable correlation analysis: http://en.wikipedia.org/wiki/ Dummy_variable_(statistics)
Cause and effect: Regression Analysis
Reviewed by Sourabh Soni
on
Wednesday, December 05, 2012
Rating:
Link from MIT Open courseware:
ReplyDeletehttp://ocw.mit.edu/courses/sloan-school-of-management/15-075j-statistical-thinking-and-data-analysis-fall-2011/lecture-notes/