# Linear regression analysis using SPSS

By Divya Narang and Priya Chetty on January 9, 2015

In order to determine the relationship between a dependent variable and a set of multiple independent variables, linear regression analysis is conducted. One can use the procedure to determine the influence of independent variables on the dependent variables and to what extent. For example, a waiter’s income (i.e. dependent variable) is based on a number of independent variables like; age, total bill amount and tip.

## Linear regression with SPSS

• Step 1: From the Menu, Choose Analyze-> Regression -> Linear as shown in Figure 1 given below:
• Step 2: This would open the linear regression dialog box (Figure 2). Select Household Income in thousands and move it to dependent list. Next Select independent variables like; Age, Number of people in household and years with current employer and then move them to independent list. Click OK to run the test.

Inferences: Table 1below shows the Model Summary for the present test. The Model Fit output consists of the “Model Summary” table and ANOVA table (Table 2). The Model Summary includes multiple correlation coefficient R and its Square i.e. R² and also the adjusted version of this coefficient as summary measures of the model fit. As can be seen, the Linear Regression Coefficient R= 0.799 indicates that there is a strong correlation between the dependent and independent variables (a closer figure to 1.000 means a strong correlation). In terms of variability, the value amount of R²= 0.634 or 63.4% which explains the variability within the population (this means that 63% population in the sample agree that on the correlation between the given variables). Further use of adjusted R² leads to a revised estimate that 60.8% of the variability in Household income in the sample is explained by three independent variables (i.e. Years with Current Employer, Age, and Number of people in Household).

Further, the Standard Error of the Estimate value reflected in Table 1 is 12.021 is the mean absolute deviation and is small considering the average household income ranges from 5000 to 25000 Rs.

### Legends

• R: Multiple correlation coefficient that tells us how strongly the multiple independent variables are related to the dependent variable
• R Square: Indicates how much of the total variation in the dependent variable is due to the independent variables
• Adjusted R Square: After removing the errors the software also presents the adjusted R square.
• Std. Error of the Estimate: It represents the standard deviation of the error term

Setting the confidence interval at 95%, the results of the ANOVA test (Table 2 below) provide an F-test value for the null hypothesis i.e. none of the independent variables is related to household income. However, based on the analysis, we can reject the Null hypothesis where F= 321.34 and p=0.001 (P< 0.01) wherein confidence interval is by default set at 95% and thus conclude that years with current employer, age and number of people in household reflect significant relation with household income. The software has calculated all 3 variables together here. If the result is not “significant” in this step, then we will not proceed to the next step, i.e. T-Test.

### Legends

• Sum of Squares: It is associated with the three sources of variance, Total, Residual and Regression. This measure is not presented when presenting results.
• df: Is associated with sources of variances where the value is N-1 i.e. number of respondents in the sample size minus 1.
• Mean Square: When sum of Squares if divided by their respective df we get Mean Square.
• F: It is the value obtained by dividing Mean Square Regression with Mean Square Residual which obtains the value of 321.34. This value should generally be above 3.95  and compliments the sig. value.
• Sig.: Significance value reflects the significance of the regression model where in value between 0.10-0.50 means that it is significant at 95% confidence interval and value between 0.001-0.10 reflects that it is significant at 99% confidence interval.

The earlier table revealed that all 3 variables when combined have a significant relation with “household income”. In the next step, we will determine the relationship between each independent variable with the “household income” individually.

The output shown in Table 3, Coefficients provides the estimation of regression coefficients, standard error of estimates, t-tests, and Significance. The estimated regression coefficients are depicted under “Unstandardized Coefficients B” which predict the change in the dependent variable (i.e. household income) when the independent variable (Age/ No. of people/ Years of experience) is increased by one unit conditional on all the other variables in the model remaining constant.

In order to test the null hypothesis, you should refer to t-statistic value where the “significance” value(0.456) reflects that Age of the individual has no effect on the household income of the individual (a confidence interval of 95% means sig value has to be less than 0.05 to be considered “significant relationship”).

### Legends

• Unstandardized Coefficients (B): Reflects the values for the regression equation which predict the relationship between dependent variable from the independent variable. In simpler terms it reflects the change in dependent value with the change in predictor value i.e. the independent variable.
• Std. Error: These are the standard errors associated with coefficients.
• Standardized Coefficients (Beta): These coefficient values indicate which will be obtained if independent variables are standardized prior to analysis. By standardization we mean that all predictors (independent variables) values are measured using same unit of measurement.
• t: This value along with sig. value is important to predict if we reject or accept the null hypothesis. Since the two values complement each other a lower sig. value would indicate higher t-value.
• Sig.: As indicated above in case of 95% confidence interval the value would lie between 0.10-0.50 and in case of 99% confidence interval it would between 0.01-0.10.

In the above demonstrations, we concluded two things:

1. The variability of all independent variables and dependent variable is 63%.
2. The relationship between Age (independent variable) and Household income (dependent variable) is “not significant” (0.456)

The above method of regression is called “Enter” regression. Apart from this, there are 3 other major methods of regression, but they are seldom used. They are Forward, Backward and Step-wise regression.

### Forward Selection

This method starts with a model containing none of the independent variables. In the first step, the procedure considers variables one by one for inclusion and selects the variable that results in the largest increase in R² (variability). Furthermore, in the second step, the procedures consider variables for inclusion in a model that only contains the variable selected in the first step. In each step, the variable with the largest increase in R² is selected until, according to an F-test, further additions are judged to not improve the model.

### Backward Selection

This method starts with a model containing all the variables and eliminates variables one by one, at each step choosing the variable for exclusion as that leading to the smallest decrease in R². Again, the procedure is repeated until, according to an F-test, further exclusions would represent a deterioration of the model.

### Step Wise Selection

In a particular study where there is a large number of independent variables, wherein you want to develop a regression model which would include only variables that are statistically related to the dependent variable then you can choose the “Stepwise” method from the drop-down list. If you will choose “stepwise” only variables which meet the criteria in the Linear Regression Options dialogue box will enter the equation.