Procedure and interpretation of linear regression analysis using STATA

Linear regression analysis is conducted to predict the dependent variable based on one or more independent variables. The basic regression equation is:

Finding impact of independent variable on dependent variable

Where;

  • Y is dependent variable
  • X is independent variable
  • e is error term

In the above regression equation, β1 measures the effect of X1 on Y. Similarly β2 measures the effect of X2 on Y. The constant term (β0) measures the value of Y if both X1 and X2 are zero. The error term e includes other factors which effect Y other than X1 and X2.

For example: if the researcher wants to determine the impact of “study time” on the overall “score” of the student. Then in this case the dependent variable is score and independent variable is study time.

In STATA, we will be using the same example which was used for correlation analysis and determine the influence of mileage and repair record (Independent Variables) on price of the vehicle (Dependent Variable). In order to run the test, goto:

Statistics > Linear models and related > Linear Regression

Using the drop down option for linear regression in STATA

Using STATA for Linear regression model

Further, it will redirect you to a new window, wherein you can select the dependent and independent variables and click “OK” to proceed.

Regression analysis with more than one independent varaible

Selecting the dependent and independent variable

Another way to run the linear regression in Stata is to type the command in the command window. To run the linear regression, following command can be used:

Regress price (dependent variable) mpg rep78 (independent variables)

The results obtained from the Regression analysis is presented below:

Showing the impact of independent variables on dependent variables using the linear regression

STATA results for linear regression analysis

On the basis of the above results the regression equation can be written as:

price = 9657.754 - 271.6425 mpg + 666.9668 rep78

The results from the above table can be interpreted as follows:

Source: It shows the variance in the dependent variable due to variables included in the regression (model) and variables not included (residuals). The “total” is the sum of model and residual value.

df: It stands for degrees of freedom which are related to the source of variance. The df for model is the total number of regression coefficients estimated minus 1. In the above results since there are total 3 coefficients (including constant) the df for model is 1. Similarly the df of the residual is the total degree of freedom minus the df of the model. In this case the total df is 68 calculated as ( n-1). So the df for residual is 68-2, which is 66.

MS: Here MS stands for Mean squares. This is calculated by dividing the sum of squres (SS) divided by their df.

Checking the overall fitness of the model

Number of obs: Total number of observations used in the regression model.

F (2, 66): This is the F statistics which is calculated by dividing the Mean square of model by Mean square of residual. In this case 11.06 is obtained by dividing 72377031.7 by 6546104.48. The values in the bracket are df of model and residual.

Prob> F: This is the significance value of the F statistic, which test the null hypothesis that all the regression coefficients in the model are zero against the alternative hypothesis that at least one of the coefficient is non zero. If this value is less than 0.05, then one can reject the null hypothesis with 95% confidence level.

In the above results since it is less than 0.05, so at least one of coefficient out of two variable is non zero.

R squared: This value shows how much variance in the dependent variable is explained by the independent variable included in the model. On the basis of the R-squared value the overall strength between the independent variable and dependent variable can be measured. However it does not show the association of each independent variable with dependent variable.

In the above case the R squared is 0.25 which shows that the two independent variables included in the model are able to explain 25% of the variation in the dependent variable.

Adj R-squared: This is the adjusted value of R squared, which is adjusted value of R square on the basis of the number of independent variables in the regression model. This can also be used to analyze the goodness of fit of the regression model. The R squared can be improved by adding more independent variables in the model but not the adjusted R square. It takes into consideration the correlation between independent variable and the dependent variable.

In the above results the adjusted R square is 0.22 which is less than the R squared value. This is because it has adjusted for the independent variables in the model on the basis of their association with the dependent variable.

Root MSE: It is the square root of Mean Square of Residual. In other words Root MSE is the standard deviation of the error term.

Interpreting the regression coefficients

The above components of the regression results are the measure of overall fit of the regression model. Now this section will discuss the interpretation of the coefficients.

mpg: The coefficient of the mpg is- 271.64.

Interpretation: With one unit increase in the mileage (mpg), the price of auto declines by 271.64 units holding all other factors constant.

rep78: Similarly the coefficient of rep78 is 666.96.

Interpretation: With one unit increase in rep78, the price of auto increases by 666.96 units holing all other factors constant.

Determining the statistical significance of the regression coefficients

The coefficient of mpg and rep78 shows negative and positive impact on price of the auto. However to examine whether the impact is statistically significant or not one needs to analyze following parameters:

Std err: It is the standard error of the regression coefficients. Standard error measure the variability in the predicted scores (regression coefficients).

t: It tests whether the coefficient of particular independent variable is significantly different from zero or not. It is calculated by dividing the coefficient by standard error.

P> |t|: It shows whether the coefficient has statistically significant impact on the dependent variable or not. If the p value is 0.05 or less, then the coefficient is to be statistically significant. In other words if the p value is 0.05, we are 95% confident that the independent variable has some effect on the dependent variable.

In the above results the p value for mpg is less than 0.000 and for rep78 is 0.056. So mpg has significant and negative impact on the price. However rep78 do not have any significant impact on price as the p value is greater than 0.05.

95% conf interval: This shows that we are 95% confident that coefficient estimated in the regression falls in this interval. So if the interval is not contaning 0, the p value will be  0.05 or less.

In this article we discussed about the linear regression, in the next article, I will be discussing about different type of regression analysis, i.e. log-linear regression, linear-log regression and log-log regression.

Indra Giri

Indra Giri

Senior Analyst at Project Guru
He completed his Masters in Development Economics from South Asian University, New Delhi. His areas of interest includes various socio development issues like poverty, inequality and unemployment in South Asia. Apart from writing for Project Guru he loves to travel and play football in his spare time.
Indra Giri

Latest posts by Indra Giri (see all)

Related articles

  • Non linear regression analysis in STATA and its interpretation In the previous article on Linear Regression using STATA, a simple linear regression model was used to test the hypothesis. However the linear regression will not be effective if the relation between the dependent and independent variable is non linear.
  • Correlation analysis using STATA Correlation analysis is conducted to examine the relationship between dependent and independent variables. There are two types of correlation analysis in STATA.
  • Building univariate ARIMA model for time series analysis in STATA Autoregressive Integrated Moving Average (ARIMA) is popularly known as Box-Jenkins method. The emphasis of this method is on analyzing the probabilistic or stochastic properties of a single time series. Unlike regression models where Y is explained by X1 X2….XN regressor (like […]
  • Testing for time series multicollinearity in STATA The problem of multicollinearity arises when one explanatory variable in a multiple regression model highly correlates with one or more than one of other explanatory variables. It is a problem because it underestimates the statistical significance of an explanatory variable (Allen, 1997).
  • How to perform Johansen cointegration test? To test cointegration, Johansen cointegration test is widely used which determines the number of independent linear combinations (k) for (m) time series variables set that yields a stationary process. The test gives the rank of cointegration.
Discussions

1 Comments.

  1. PAUL A KAYAGILA

    1.What are the effects of poverty on economic growth to developing countries?
    2. What are causes and effects of unemployment?
    3. What are the basis and impact of international trade?
    4. Why some countries are poor while others are rich?

Reply to PAUL A KAYAGILA ¬
Cancel reply

We are looking for candidates who have completed their master's degree or Ph.D. Click here to know more about our vacancies.