# Linear regression analysis using SPSS

In order to determine the relationship between a dependent variable and a set of multiple independent variables, linear regression analysis is conducted. One can use the procedure to determine the influence of independent variables on the dependent variables and to what extent. For example, a waiter’s income (i.e. dependent variable) is based on a number of independent variables like; age, total bill amount and tip.

## Linear regression with SPSS

**Step 1:**From the Menu, Choose Analyze-> Regression -> Linear as shown in Figure 1 given below:

**Step 2:**This would open the linear regression dialogue box (Figure 2). Select Household Income in thousands and move it to the dependent list. Next Select independent variables like; Age, Number of people in the household and years with current employer and then move them to the independent list. Click OK to run the test.

**Inferences:** Table 1 below shows the Model Summary for the present test. The Model Fit output consists of the “Model Summary” table and the ANOVA table (Table 2). The Model Summary includes multiple correlation coefficient R and its Square i.e. R² and also the adjusted version of this coefficient as summary measures of the model fit. As can be seen, the Linear Regression Coefficient R= 0.799 indicates that there is a strong correlation between the dependent and independent variables (a closer figure to 1.000 means a strong correlation). In terms of variability, the value amount of R²= 0.634 or 63.4% which explains the variability within the population (this means that 63% population in the sample agree on the correlation between the given variables). Further use of adjusted R² leads to a revised estimate that 60.8% of the variability in Household income in the sample is explained by three independent variables (i.e. Years with Current Employer, Age, and Number of people in the Household).

Further, the Standard Error of the Estimate value reflected in Table 1 is 12.021 is the mean absolute deviation and is small considering the average household income ranges from 5000 to 25000 Rs.

Model |
R |
R Square |
Adjusted R Square |
Std. Error of the Estimate |

1 |
.799 |
.638 |
.608 |
12.021 |

### Legends

**R:**Multiple correlation coefficient that tells us how strongly the multiple independent variables are related to the dependent variable**R Square:**Indicates how much of the total variation in the dependent variable is due to the independent variables**Adjusted R Square:**After removing the errors the software also presents the adjusted R square.**Std. Error of the Estimate:**It represents the standard deviation of the error term

Setting the confidence interval at 95%, the results of the ANOVA test (Table 2 below) provide the F-test value for the null hypothesis i.e. none of the independent variables is related to household income. However, based on the analysis, we can reject the Null hypothesis where F= 321.34 and p=0.001 (P< 0.01) wherein the confidence interval is by default set at 95% and thus conclude that years with current employer, age and the number of people in household reflect significant relation with household income. The software has calculated all 3 variables together here. If the result is not “significant” in this step, then we will not proceed to the next step, i.e. T-Test.

Model | Sum of Squares | df | Mean Square | F | Sig. |

Regression | 430.337 | 3 | 143.446 | 321.34 | .001^{a} |

Residual | .000 | 251 | .000 | ||

Total | 430.337 | 254 |

^{b}(Analysis of Variance)

### Legends

**The sum of Squares:**It is associated with the three sources of variance, Total, Residual and Regression. This measure is not presented when presenting results.**df:**Is associated with sources of variances where the value is N-1 i.e. number of respondents in the sample size minus 1.**Mean Square:**When summing Squares if divided by their respective df we get a Mean Square.- F: It is the value obtained by dividing Mean Square Regression by Mean Square Residual which gets the value of 321.34. This value should generally be above 3.95 and compliments the sig. value.
**Sig.:**Significance value reflects the significance of the regression model wherein a value between 0.10-0.50 means that it is significant at a 95% confidence interval and a value between 0.001-0.10 reflects that it is significant at a 99% confidence interval.

The earlier table revealed that all 3 variables when combined have a significant relation with “household income”. In the next step, we will determine the relationship between each independent variable with the “household income” individually.

The output shown in Table 3, Coefficients provides the estimation of regression coefficients, standard error of estimates, t-tests, and Significance. The estimated regression coefficients are depicted under “Unstandardized Coefficients B” which predict the change in the dependent variable (i.e. household income) when the independent variable (Age/ No. of people/ Years of experience) is increased by one unit conditional on all the other variables in the model remaining constant.

In order to test the null hypothesis, you should refer to the t-statistic value where the “significance” value(0.456) reflects that the Age of the individual has no effect on the household income of the individual (a confidence interval of 95% means sig value has to be less than 0.05 to be considered “significant relationship”).

Unstandardized Coefficient B | Unstandardized Coefficient Std. Error | Standardized Coefficients Beta | t | Sig. | ||

1 | (Constant) | 11.306 | 7.315 | 1.546 | .131 | |

Age | .464 | .130 | 3.564 | .439 | .456 | |

Number of People in Household | .156 | .205 | .754 | .082 | .001 | |

Years with Current Employer | 21.071 | 4.561 | .4.315 | .487 | .000 |

^{a}

### Legends

**Unstandardized Coefficients (B):**Reflects the values for the regression equation which predict the relationship between the dependent variable from the independent variable. In simpler terms, it reflects the change independent value with the change in predictor value i.e. the independent variable.**Std. Error:**These are the standard errors associated with coefficients.**Standardized Coefficients (Beta):**These coefficient values indicate what will be obtained if independent variables are standardized prior to analysis. By standardization, we mean that all predictors (independent variables) values are measured using the same unit of measurement.**t:**This value along with sig. value is important to predict if we reject or accept the null hypothesis. Since the two values complement each other a lower sig. the value would indicate a higher t-value.**Sig.:**As indicated above in the case of a 95% confidence interval the value would lie between 0.10-0.50 and in the case of a 99% confidence interval, it would be between 0.01-0.10.

In the above demonstrations, we concluded two things:

- The variability of all independent variables and the dependent variables is 63%.
- The relationship between Age (independent variable) and Household income (dependent variable) is “not significant” (0.456)

The above method of regression is called “Enter” regression. Apart from this, there are 3 other major methods of regression, but they are seldom used. They are Forward, Backward and Step-wise regression.

### Forward Selection

This method starts with a model containing none of the independent variables. In the first step, the procedure considers variables one by one for inclusion and selects the variable that results in the largest increase in R² (variability). Furthermore, in the second step, the procedures consider variables for inclusion in a model that only contains the variable selected in the first step. In each step, the variable with the largest increase in R² is selected until, according to an F-test, further additions are judged to not improve the model.

### Backward Selection

This method starts with a model containing all the variables and eliminates variables one by one, at each step choosing the variable for exclusion as that leads to the smallest decrease in R². Again, the procedure is repeated until, according to an F-test, further exclusions would represent a deterioration of the model.

### Step Wise Selection

In a particular study where there is a large number of independent variables, wherein you want to develop a regression model which would include only variables that are statistically related to the dependent variable then you can choose the “Stepwise” method from the drop-down list. If you choose “stepwise” only variables that meet the Linear Regression Options dialogue box criteria will enter the equation.

Priya is the co-founder and Managing Partner of Project Guru, a research and analytics firm based in Gurgaon. She is responsible for the human resource planning and operations functions. Her expertise in analytics has been used in a number of service-based industries like education and financial services.

Her foundational educational is from St. Xaviers High School (Mumbai). She also holds MBA degree in Marketing and Finance from the Indian Institute of Planning and Management, Delhi (2008).

Some of the notable projects she has worked on include:

- Using systems thinking to improve sustainability in operations: A study carried out in Malaysia in partnership with Universiti Kuala Lumpur.
- Assessing customer satisfaction with in-house doctors of Jiva Ayurveda (a project executed for the company)
- Predicting the potential impact of green hydrogen microgirds (A project executed for the Government of South Africa)

She is a key contributor to the in-house research platform Knowledge Tank.

She currently holds over 300 citations from her contributions to the platform.

She has also been a guest speaker at various institutes such as JIMS (Delhi), BPIT (Delhi), and SVU (Tirupati).

## Discuss