Why is it important to test heteroskedasticity in a dataset?

By Riya Jain & Priya Chetty on March 23, 2020

Heteroskedasticity refers to the state of systematic changes in the spread of residuals or the error term of the model. The presence of residual variance in a model shows that the scattering of the model is dependent on at least one independent variable. This adds business to the model and hence creates a scenario of deviation of the model from effective and actual results.

For example, a study aims to identify the factors which lead to emotional exhaustion in an organization. Job control, work pressure, and concentration requirements are some of the main factors which affect the emotional state of an employee. Thus they are the independent variables. Here as job control plays the dominant role, its effect buffers out the work pressure effect on the emotional exhaustion. This diminished work pressure effect due to the high value of job control shows that as the result of the model is highly influenced by a single variable i.e. job control, thus there is the presence of heteroskedasticity in the model.

Example of heteroskedasticity
Figure 1: Example of heteroskedasticity

Why testing for heteroskedasticity is essential?

Heteroskedasticity in a model can be present due to any of the below reasons:

  • Existence of outliers in the dataset.
  • Collection of data from different scales.
  • Not specification of the model correctly.
  • Usage of an incorrect transformation method to represent the model.

Each of the above-stated cases can cause a variation in the results from an efficient outcome. Thus, the presence of heteroskedasticity in the model leads to a violation of the assumption of the ordinary least square (OLS) regression and tends to provide biased results. Moreover, it renders the results of t or F unreliable.

What are the different tests for examining the presence of heteroskedasticity in the model?

Heteroskedasticity in a model can be seen through two different forms of testing i.e. graphical or visual, and statistical. Specifications about each of the tests are shown in the below table.

Tests Data Assumption Benefits Limitations Heteroskedasticity condition Null Hypothesis
Scatter plot of residuals Work for all-day Helps in detecting the presence of heteroskedasticity Vague information about heteroskedasticity presence Rough cone shape spread of the dataset  
Barlett Test The dataset should be normally distributed. It provides reliable and accurate results.   The results of the test is sensitive to normality.
Suitable only when the normality of data is confirmed.
Chi-square Test statistic value is greater than the significance value. Variances are equal for all the samples.
Levene’s Test The dataset should have only a minor deviation from normality. It can be applied to non-normal distribution.
Derives accurate results.
In-built in most statistical software.
Accurate results cannot be determined for datasets with a major deviation from a normal distribution. W Test statistic value is greater than the significance value. Variances are equal for all the samples.
Brown-Forsythe Test Dataset cab is non-normally distributed. Suitable for non-normal distribution. Accurate results are not derived. ANOVA F or W Test statistic value is greater than the significance value.Variances are equal for all the samples.
Breusch-Pagan-Godfrey (Breusch-Pagan) Test/ Cook-Weisberg Test Independent variables could have a linear relationship Datasets should be normally distributed. Derives accurate results.
Suitable for large sample sizes.
Not applicable in case of a non-linear relationship.
Heteroskedasticity in non-normal distribution could not be analyzed.
Chi-square Test statistic value is greater than the significance value.Variances are equal for all the samples.
White Test Applicable only when the model consists of a non-linear relationship. Suitable for non-linear relationships. Accurate results not derived in the case of large samples.
Faulty results can be derived.
Chi-square Test statistic value is greater than the significance value. Variances are equal for all the samples.
Goldfeld-Quant Test Data should be normally distributed. Helps in detecting the presence of heteroskedasticity Not suitable for the non-normally distributed dataset.
Generalized results could not be derived.
Influenced by the criteria chosen for separating groups.
Option to adjust the model for removal of heteroskedasticity not available.
F Test statistic value is greater than the significance value. Variances are equal for all the samples.
Park Test The Independent variable having an influence on the spread of residual is identified. Helps in detecting the presence of heteroskedasticity Assumptions are made about the functional form of the model and processing is done accordingly.
The contribution of non-identified independent variables could not be determined.
Direct computation of the value using the software could not be done.
The t-Test statistic value is greater than the significance value.Variances are equal for all the samples.
Glejser Test The dataset should be symmetrically distributed. Helps in detecting the presence of heteroskedasticity Not suitable for asymmetrically distributed datasets. The t-Test statistic value is greater than the significance value.Variances are equal for all the samples.
Hartley’s (Fmax) Test The dataset should be normally distributed.
Sample derived from each population should be the same.
Simple to compute.
It helps in detecting the presence of heteroskedasticity.
Fewer chances of deriving reliable results.
It could not function effectively on the non-normally distributed dataset.
F Test statistic value is greater than the significance value.Variances are equal for all the samples.
F-test The dataset should be normally distributed.
The sample must consist of independent events.
Helps in detecting the presence of heteroskedasticity It could not function effectively on the non-normally distributed dataset.F Test statistic value is greater than the significance value.Variances are equal for all the samples.

Table 1: Tests for detecting Heteroskedasticity

Among all these tests, Scatter plot, Barlett, Levene’s, Breusch-Pagan, Cook-Weisberg and White test are the most used Heteroskedasticity tests. SPSS, Stata, and R are the software that supports these tests (except Barlett test in SPSS).  However, in the case of regression analysis in SPSS, scatter plot and F-test are the most used method for heteroskedasticity tests.

Case example

In order to assess the presence of heteroskedasticity in the model stating the impact of job control, work pressure, and control requirements on the Emotional exhaustion level; initially the normality of the dataset is tested. The testing for normality and heteroskedasticity is done using SPSS.

Statistical tests of normality are discussed here. The results of the Shapiro-Wilk test are shown below.

  Statistic Sig.
Emotional Exhaustion 0.791958 0.003976
Job control 0.792225 0.004006
Work pressure 0.83254 0.012978
Control requirements 0.791468 0.003922

Table 2: Normality results (SPSS results)

The above table shows that as the significance value for each of the variables is less than the significance level of 0.05, the null hypothesis of the normally distributed datasets is not rejected. This shows that all the variables are normally distributed.

Heteroskedasticity test via scatter plot of residuals

Linear regression analysis was performed for the variables. Next, the scatter plot of the residuals was generated by following the below steps.

Step 1: Select Analyze>Regression>Linear. Below dialog box will appear.

Linear regression analysis in SPSS
Figure 2: Linear regression analysis in SPSS

Step 2: Allocate the variables independent and independent form and then click on ‘Plots’. A below-shown dialog box will appear.

Categorisation of variables
Figure 3: Categorisation of variables

Step 3: Allocate ZPRED value as ‘X’ variable and ZRESID value as ‘Y’ variable. Select a standardized residual plot in the form of ‘Normal probability plot’ and then click on ‘Continue’.

Residual plot
Figure 4: Residual plot

Step 4: Click on ‘Ok’. Below shown scatterplot will be generated as output.

Residual scatterplot
Figure 5: Residual scatterplot

The above figure shows that the residual values are spread. Although mostly the values are concentrated close to 0 and 1, there is no consistency in the spread of residuals. Hence there is a presence of heteroskedasticity in the model.

Heteroskedasticity test via F-test

The procedure of regression analysis in SPSS is discussed here. The results shown in the ANOVA table are represented below.

F-value Sig.
24.325 .000

Table 3: F-test results

The above table shows that the significance value for the F-test is 0.000 which is less than the significance level of the study i.e. 0.05. Thus, the null hypothesis of equal variance is rejected. Hence, there is a presence of heteroskedasticity in the model.

How to overcome the problem of heteroskedasticity?

In order to remove the problem of heteroskedasticity from the model, it is often recommended to follow any of the below-stated methods:

  • Regression analysis using the robust standard method.
  • Generalized least square method-based regression analysis.
  • Weighted least square method-based regression analysis.

Other than these methods, the log-based transformation of the dataset is also applicable as log transformation helps in reducing the effect of errors in the model. Hence, following any of the above-stated methods homoscedasticity can be achieved and accurate and reliable results with minimum variability in the model could be derived.

References

Discuss

4 thoughts on “Why is it important to test heteroskedasticity in a dataset?”