How to conduct generalized least squares test?

In statistics, Generalized Least Squares (GLS) is one of the most popular methods for estimating unknown coefficients of a linear regression model when the independent variable is correlating with the residuals. Ordinary Least Squares (OLS) method only estimates the parameters in linear regression model. Also, it seeks to minimize the sum of the squares of the differences between the observed responses in the given dataset and those predicted by a linear function. The main advantage of using OLS regression for estimating parameters is that it is easy to use. However OLS gives robust results only if there are no missing values in the data and there are no major outliers in the data set. Moreover, OLS regression model does not take into account unequal variance, or ‘heteroskedastic errors’. Due to heteroskedastic errors the results are not robust and also creates bias.

Therefore, the generalized least squares test is crucial in tackling the problem of outliers, heteroskedasticity and bias in data. It is capable of producing estimators that are ‘Best Linear Unbiased Estimates’. Thus, GLS estimator is unbiased, consistent, efficient and asymptotically normal. 

Major assumption for generalized least square regression analysis

The assumption of GLS is that the errors are independent and identically distributed. Furthermore, other assumptions include:

  • The error variances are homoscedastic
  • Errors are uncorrelated
  • Normally distributed

In the absence of these assumptions, the OLS estimators and the GLS estimators are same. Thus, the difference between OLS and GLS is the assumptions of the error term of the model. There are 3 different perspectives from which one can understand the GLS estimator:

  • A generalization of OLS
  • Transforming the model equation to a new model whose errors are uncorrelated and have equal variances that is homoskedastic.

Example of generalized least squares test

This section explains the process of applying GLS with the use of a case study. The sample dataset contains data of 30 students. The aim is to review the impact of self-efficiency and ability (independent variable) on achievement (dependent variable). For this case study first a simple linear regression is performed and the results are compared with the generalized least squares test.

Step 1: Linear regression

Table: 1 Simple linear regression of case study

Table: 1 Simple linear regression of case study

Since the dependent variable is continuous in nature, it is important to confirm if the dependent variable follows normal distribution. The distribution of residuals of dependent variable (achievement) is normal, with skewness -0.18 and kurtosis 1.95. As the table above shows, linear regression was performed to check the relationship between achievement and self-efficiency and ability. The parameter estimates was 0.003 with p value 0.989. For another independent variable ability, the parameter estimates was -0.047 with p value 0.823. This shows that none of the independent variable are statistically significant as the p value is greater than 0.05.

The interpretation of coefficients of the independent variables is as follows:

  • The independent variable ‘self-efficiency’ is positively related to dependent variable ‘achievement’. However other independent variable ‘ability’ is negatively attributed to the dependent variable.
  • The estimates parameter and p value shows that the sample size was inadequate to demonstrate the true spectrum of relationship .
  • Furthermore, for every unit of rise in self-efficiency, the dependent variable also increases by 1 unit, keeping all other factors same.

Step 2: Weighted least squares regression

Table 2: Weighted least squares regression of generalized least squares case study

Table 2: Weighted least squares regression of case study

After performing the weighted analysis, self-efficiency was found to influence the achievement more, with beta coefficient of 0.045 and value of 0.021. This shows that the regression coefficient is statistically significant. Ability influenced the achievement less, with beta coefficient of 0.014 with value 0.046. Both the p values are statistically significant which indicates that GLS is a better fit than simple regression done previously. Therefore there is significant importance of ranking or relationship between dependent variable ‘achievement’ and independent variable ‘self- efficiency’ and ‘ability’.

 Application of generalized least squares

  • GLS model is useful in regionalization of hydrologic data.
  • GLS is also useful in reducing autocorrelation by choosing an appropriate weighting matrix.
  • It is one of the best methods to estimate regression models with auto correlate disturbances and test for serial correlation (Here Serial correlation and auto correlate are same things).
  • One can also learn to use the maximum likelihood technique to estimate the regression models with auto correlated disturbances.
  • The GLS procedure finds extensive use across various domains.The goal of GLS method to estimate the parameters of regional regression models of flood quantiles.
  • GLS is widely popular in conducting market response model, econometrics and time series analysis.

A number of available software support the generalized least squares test, like R, MATLAB, SAS, SPSS, and STATA.

Priya Chetty

Partner at Project Guru
Priya Chetty writes frequently about advertising, media, marketing and finance. In addition to posting daily to Project Guru Knowledge Tank, she is currently in the editorial board of Research & Analysis wing of Project Guru. She emphasizes more on refined content for Project Guru's various paid services. She has also reviewed about various insights of the social insider by writing articles about what social media means for the media and marketing industries. She has also worked in outdoor media agencies like MPG and hotel marketing companies like CarePlus.

Related articles

  • How to perform nonlinear regression? Regression analysis is a statistical tool to study the relationship between variables. These variables are the outcome variable and one or more exposure variables. In other words, regression analysis is an equation which predicts a response from the value of a certain predictor.
  • How to perform cross validation on a data set? Thus to assess the model, a common practice in data science is to iterate over various models and select the most appropriate model. In other words it is important to test the same model with different values of parameters.This is called the cross validation method.
  • How to apply logistic regression in a case? Machine learning involves solutions to predict scenarios based on past data. Logistic regression offers probability functions based on inputs and their corresponding output.
  • How to perform and apply Monte Carlo simulation? Monte Carlo simulation is an extension of statistical analysis where simulated data is produced. This method uses repeated sampling techniques to generate simulated data.
  • Performing Canonical Correlation Analysis (CCA) Until recently, Karl Pearson Correlation analysis was one of the most popular methods to measure linear association between two or more than two variables in a data set. For example, establishing the Karl Pearson Correlation between X variable and Y variable, where both variables belong […]


We are looking for candidates who have completed their master's degree or Ph.D. Click here to know more about our vacancies.