Why conduct a multicollinearity test in econometrics?

By Riya Jain & Priya Chetty on March 19, 2020

A multicollinearity test helps to diagnose the presence of multicollinearity in a model. Multicollinearity refers to a state wherein there exists inter-association or inter-relation between two or more independent variables. Presence of multicollinearity in a dataset is problematic because of four reasons:

It causes increased variability in the dataset.
It causes the dataset to be extremely sensitive to minor changes.
It causes instability in the regression model.
It leads to skewed and unreliable results.

For example, a study aims to determine which factors influence customer loyalty. It found four possible factors: customer satisfaction, product quality, service quality, and brand awareness. Thus they are the independent variables. However, the study also found that customer satisfaction is correlated with product quality and service quality. Thus, this existence of interlinkage between independent variables signifies the presence of multicollinearity in the model.

When does multicollinearity arise?

The problem of multicollinearity arises mainly due to two reasons i.e.

Poorly collected or manipulated data; or
Structural problems like the inclusion of variable computed using other independent variables, repetition of similar variable, or dummy variable inaccurate use.

Different tests for examining the presence of multicollinearity

There are different ways to detect whether multicollinearity is present in a model or not. The below table provides a list of tests, explaining the applicability of each.

Test	Data Assumptions	Benefit	Disadvantage	How to perform	Multicollinearity Condition
Pearson Correlation	A number of independent variables should be less than 5.	Deduce information about the interlinkage between the independent variables.	Suitable only for multiple regression with few (<5) independent variables. Does not work efficiently on large sample size.	Check the Pearson Correlation coefficient value between the independent variables.	Value of the Pearson coefficient should be greater than 0.5.
Variance Inflation Factor (VIF)	Suitable for a large number of independent variables.	Helps identify the presence of multicollinearity	Adequate results not derived in case of non-linear regression. Not suitable for cases with 3 or more categorical or dummy variables.	Perform the regression analysis between the dependent and independent variable by including the collinearity diagnostics option.	Value of VIF should be greater than 5.
P-value and Coefficient value Test (Regression Analysis)	The theoretical model specifies a strong influence of the dependent variable due to independent variable data.	Helps to validate the statistical results with theory. Work as an indicator of multicollinearity.	Accurate results not derived. Suitable only for a small sample size.	Perform the regression analysis between the dependent and independent variable and check the p-value of the independent variable in the coefficient table.	Coefficient table- P-value is greater than the significance level of study i.e. independent variable is insignificant. Or Sign of coefficient different from the sign states in theories.
ANOVA (Regression Analysis)	The overall model is appropriate, but the independent variables do not yield significant results,+.	Work as an indicator of multicollinearity.	Accurate results are not derived. Suitable for a small sample size only.	Perform the regression analysis between the dependent and independent variable and check the p-value of an independent variable in the coefficient table.	ANOVA and coefficient table – p-value of F significant but independent variables p-value is insignificant.

Table 1: Multicollinearity tests

Among all these tests, Pearson’s coefficient and VIF are the most used tests for examining the presence of multicollinearity. SPSS, Stata, and R are software that can be used for computation.

Continuing the example stated above, the presence of multicollinearity is examined in the model stating that customer loyalty is affected by customer satisfaction, product quality, service quality, and brand awareness. The analysis was done using SPSS software.

Multicollinearity test via Pearson’s correlation coefficient

The value of the Pearson correlation coefficient for all the independent variables was computed. The correlation matrix is shown in the below table.

	Customer Satisfaction	Product Quality	Service Quality	Brand Awareness
Customer Satisfaction	1	.934^**	.835^**	.488^**
Product Quality	.934^**	1	.896^**	.420^**
Service Quality	.835^**	.896^**	1	.337^*
Brand Awareness	.488^**	.420^**	.337^*	1

Table 2: Correlation Matrix (SPSS results)

** Correlation is significant at the 0.01 level (2-tailed).

* Correlation is significant at the 0.05 level (2-tailed).

Above table shows that the coefficient value for the linkage between customer satisfaction, product quality, and service quality is greater than 0.5. Thus, there is a presence of multicollinearity in the model.

Multicollinearity test via Variance Inflation Factor (VIF)

Step 1: Import data in SPSS.

Step 2: Select Analyze>Regression>Linear

Figure 2: Step 2 of conducting multicollinearity test via VIF

The below-shown dialogue box will appear.

Figure 3: Multicollinearity test via VIF in SPSS

Step 3: Select ‘Statistics’ and then click on ‘Collinearity Diagnostics’. Select ‘Continue’.

Figure 4: Collinearity test computation via VIF

Step 4: Categorize the variables into ‘Dependent’ and ‘Independent’ variables and then select ‘OK’.

Below shown VIF and collinearity diagnostic table will appear.

Constant	Tolerance	VIF
Customer Satisfaction	.116	8.599
Product Quality	.083	12.008
Service Quality	.196	5.099
Brand Awareness	.744	1.344

Table 3: VIF results from collinearity statistics

The above table shows that the value of VIF is higher for customer satisfaction (8.599>5), product quality (12.008 > 5), and service quality (5.099 > 5) while it is low for brand awareness (1.344 < 5). Thus, multicollinearity is present in the model. The variables customer satisfaction, product quality and service quality are inter-related.

Multicollinearity test via P-value

Regression analysis reveals the significance value for each independent variable in the model. The procedure of regression analysis is explained here. Results of the analysis are shown in the below table.

Customer Loyalty	Coefficient	Sig (p-value)
(Constant)	.349	.343
Customer Satisfaction	.364	.148
Product Quality	.108	.727
Service Quality	.126	.526
Brand Awareness	.281	.003

Table 4: P-value results

Above table shows that the coefficient value matches with the theoretical linkage between the dependent (customer loyalty) and independent (customer satisfaction, product quality, service quality, and brand awareness) variables i.e. positive relationship. Despite this strong positive relationship between the independent and dependent variables, the p-value of customer satisfaction (0.148), product quality (0.727), and service quality (0.526) is insignificant i.e. greater than 0.05. Thus, multicollinearity might exist in the model.

Multicollinearity test via ANOVA

The regression analysis procedure is shown here. Results of the ANOVA are represented in the below table.

Customer Loyalty	Coefficient	Sig (p-value)	F-value	F-value Sig
(Constant)	.349	.343	27.328	0.00
Customer Satisfaction	.364	.148
Product Quality	.108	.727
Service Quality	.126	.526
Brand Awareness	.281	.003

Table 5: ANOVA results

Results shown in the above depicts that the model is jointly significant i.e. the significance of the F-value is 0.000 which is less than the significance level of the study 0.05. Even the F-value is greater than 1 i.e. 27.328 > 1, representing that the inclusion of the independent variables in the model has improved the prediction of the customer loyalty value. Thus, the overall model is appropriate and there is a possibility of multicollinearity in the model.

How to overcome the problem of Multicollinearity?

In order to remove the problem of multicollinearity from the model, it is recommended to remove the highly correlated independent variable from the model. Also, perform an analysis with highly correlated independent variables i.e. partial least square regression or principal component analysis.

In the above example, as all tests show that there is high inter-relation with product quality i.e.

High Pearson correlation value (with customer satisfaction – 0.934, and with service quality – 0.896).
High VIF (i.e. 12.008).
High p-value (i.e. 0.727).

Thus, in order to remove multicollinearity, product quality will be removed from the model. The new analysis would be performed using customer satisfaction, service quality, and brand awareness as independent variables.