Procedure and interpretation of linear regression analysis using STATA

By Indra Giri & Priya Chetty on February 3, 2017

Linear regression analysis is conducted to predict the dependent variable based on one or more independent variables. The basic regression equation is:

Finding impact of independent variable on dependent variable

Where;

Y is dependent variable
X is independent variable
e is error term

In the above regression equation, β1 measures the effect of X1 on Y. Similarly β2 measures the effect of X2 on Y. The constant term (β0) measures the value of Y if both X1 and X2 are zero. The error term e includes other factors which affect Y other than X1 and X2.

For example: if the researcher wants to determine the impact of “study time” on the overall “score” of the student. Then in this case the dependent variable is score and the independent variable is study time.

In STATA, we will be using the same example which was used for correlation analysis and determining the influence of mileage and repair record (Independent Variables) on the price of the vehicle (Dependent Variable). In order to run the test, go to:

Statistics > Linear models and related > Linear Regression

Using the drop down option for linear regression in STATA — Using STATA for Linear regression model

Further, it will redirect you to a new window, wherein you can select the dependent and independent variables and click “OK” to proceed.

Regression analysis with more than one independent varaible — Selecting the dependent and independent variable

Another way to run the linear regression in Stata is to type the command in the command window. To run the linear regression, the following command can be used:

Regress price (dependent variable) mpg rep78 (independent variables)

The results obtained from the Regression analysis is presented below:

Showing the impact of independent variables on dependent variables using the linear regression — STATA results for linear regression analysis

On the basis of the above results the regression equation can be written as:

price = 9657.754 - 271.6425 mpg + 666.9668 rep78

The results from the above table can be interpreted as follows:

Source: It shows the variance in the dependent variable due to variables included in the regression (model) and variables not included (residuals). The “total” is the sum of the model and residual value.

df: It stands for degrees of freedom that are related to the source of variance. The df for the model is the total number of regression coefficients estimated minus 1. In the above results since there is a total of 3 coefficients (including constant), the df for the model is 1. Similarly, the df of the residual is the total degree of freedom minus the df of the model. In this case, the total df is 68 calculated as ( n-1). So the df for residual is 68-2, which is 66.

MS: Here MS stands for Mean squares. This is calculated by dividing the sum of squares (SS) divided by their df.

Checking the overall fitness of the model

Number of obs: Total number of observations used in the regression model.

F (2, 66): This is the F statistics which is calculated by dividing the Mean square of the model by the Mean square of residual. In this case, 11.06 is obtained by dividing 72377031.7 by 6546104.48. The values in the bracket are df of model and residual.

Prob> F: This is the significance value of the F statistic, which test the null hypothesis that all the regression coefficients in the model are zero against the alternative hypothesis that at least one of the coefficient is non zero. If this value is less than 0.05, then one can reject the null hypothesis with a 95% confidence level.

In the above results since it is less than 0.05, at least one of the coefficients out of two variables is non zero.

R squared: This value shows how much variance in the dependent variable is explained by the independent variable included in the model. On the basis of the R-squared value, the overall strength between the independent variable and dependent variable can be measured. However, it does not show the association of each independent variable with the dependent variable.

In the above case, the R squared is 0.25 which shows that the two independent variables included in the model are able to explain 25% of the variation in the dependent variable.

Adj R-squared: This is the adjusted value of R squared, which is the adjusted value of R square on the basis of the number of independent variables in the regression model. This can also be used to analyze the goodness of fit of the regression model. The R squared can be improved by adding more independent variables in the model but not the adjusted R square. It takes into consideration the correlation between the independent variable and the dependent variable.

In the above results, the adjusted R square is 0.22 which is less than the R squared value. This is because it has adjusted for the independent variables in the model on the basis of their association with the dependent variable.

Root MSE: It is the square root of Mean Square of Residual. In other words, Root MSE is the standard deviation of the error term.

Interpreting the regression coefficients

The above components of the regression results are the measure of the overall fit of the regression model. Now this section will discuss the interpretation of the coefficients.

mpg: The coefficient of the mpg is- 271.64.

Interpretation: With one unit increase in the mileage (mpg), the price of auto declines by 271.64 units holding all other factors constant.

rep78: Similarly the coefficient of rep78 is 666.96.

Interpretation: With one unit increase in rep78, the price of auto increases by 666.96 units holding all other factors constant.

Determining the statistical significance of the regression coefficients

The coefficient of mpg and rep78 shows negative and positive impacts on the price of the auto. However to examine whether the impact is statistically significant or not one needs to analyze the following parameters:

Std err: It is the standard error of the regression coefficients. The standard error measures the variability in the predicted scores (regression coefficients).

t: It tests whether the coefficient of the particular independent variable is significantly different from zero or not. It is calculated by dividing the coefficient by standard error.

P> |t|: It shows whether the coefficient has a statistically significant impact on the dependent variable or not. If the p-value is 0.05 or less, then the coefficient is to be statistically significant. In other words, if the p-value is 0.05, we are 95% confident that the independent variable has some effect on the dependent variable.

In the above results the p-value for mpg is less than 0.000 and for rep78 is 0.056. So mpg has a significant and negative impact on the price. However, rep78 do not have any significant impact on price as the p-value is greater than 0.05.

95% conf interval: This shows that we are 95% confident that the coefficient estimated in the regression falls in this interval. So if the interval is not containing 0, the p-value will be 0.05 or less.

In this article we discussed linear regression, in the next article, I will be discussing different types of regression analysis, i.e. log-linear regression, linear-log regression and log-log regression.

Checking the overall fitness of the model

Interpreting the regression coefficients

Determining the statistical significance of the regression coefficients

Discuss

4 thoughts on “Procedure and interpretation of linear regression analysis using STATA”

proofreading