How to apply missing data imputation?

By Prateek Sharma & Priya Chetty on March 9, 2018

Missing data is one of the most common problems in almost all statistical analyses. If the data is not available for all the observations of variables in the model, then it is a case of ‘missing data’. Missing data are part of almost all researches. They are also a common problem in most scientific research domains such as biology and medicine. If missing values are not treated well then complications arise in handling and analyzing the data. This impacts the efficiency of analysis and produces biased results. Therefore missing data imputation is important, wherein the missing values are replaced with substituted values. Missing data problem can arise due to various reasons such as mishandling of samples or by deleting unwanted values.

Types of missing data values in a dataset

There are 3 types of missing data values in a dataset:-

  • Missing completely at random (MCAR): This type of missing values occur at random, without having any relation with the values which are available to observe. Their values are missing.
  • Missing at random (MAR): The reason behind the occurrences of this type of missing values is that unintentional omitting errors occur while collecting the data. This type of missing value does not concern with the value of the variable containing missing. Also, it mostly pertains to a particular variable only.
  • Missing not at random (MNAR): This type of missing values mostly occur because of the intentional omission of the values in a particular feature at the time of collection of data. In other words, this type of missing value does not occur randomly and depends on known or unknown variable.

Methods for handling missing data imputation

There are many ways to handle missing data. The most common methods are as follows:

Mean or mode imputation

Mean imputation is a method in which the mean of the available cases replaces the missing value of a certain variable. This method maintains the sample size and is easy to handle, but the variability in the data is lesser. Therefore the standard deviations and the variance estimates tend is set too low. For example if in the data set (1, 3,4, 7, x,10), one value is missing, denoted by ‘x’. So according to the mean/mode imputation the missing value depends upon the mean of all other numbers in the data set. So here the value of x will be [(1+3+4+7+10)/5] which is equal to 5.

Regression imputation

In regression imputation the imputed value is predicted from a regression equation. A regression model is first helps to predict the values of variable (which has missing data) using the other variables. Then using that regression model imputation of values for missing data takes place.

Stochastic regression

This regression aims to reduce the bias by an extra step of augmenting each predicted score with a residual term. This is similar to the regression imputation, with error term included, which was not the case in simple regression imputation.

Hot Deck Imputation

Hot-deck imputation is a technique matching non-respondents to resembling respondents and the imputation of missing value takes place with the score of that similar respondent. To run the hot deck imputation, first step is to arrange the data as per the number of variables. After sorting, duplication of the missing value takes place, as per the value which is immediately prior to the missing value.

Cold deck

A systematic selection of value from an individual who has similar values on other variables takes place. In cold deck the response of the same respondents from other variables is relevant, whereas in hot deck the response from other respondents from same variable is relevant.

Interpolation and extrapolation

One can estimate this value from other observations from the same individual. This is similar to the cold deck missing data imputation, the only difference is that in cold deck the specific value is selected whereas in this case the value is estimated.

Multiple imputation method

Multiple imputation overcomes the problem of noise in the single imputation method. Unlike the single imputation, in the multiple imputation method the imputed values are extracted m times rather than just once.

SPSS example of missing data imputation

In many research studies due to circumstances beyond the control of researchers, the problem of missing data arises. This section presents a case study to explain the process of analysis using missing data imputation.

This case study consists of a dataset of the relative consumption of food items in European and Scandinavian countries. After entering the data in SPSS, multiple imputation method helps generate missing values. The process for generating results is as follows.

Analyze> Multiple Imputation> Analyze Pattern

First it is important to check the patterns of missing observations in the dataset in order to apply the appropriate imputation method. Here two different results appear. First is the variable summary where one can observe the percentage of missing observations in data set. For instance, as image below shows, 27% of data is missing for Coffee, 22.9% for Tea and so on.

Variable Summary
Missing Valid N Mean Std. Deviation
N Percent
Coffee 13 27.1% 35 79.0857 20.00349
Tea 11 22.9% 37 77.9459 22.45359
Bread 11 22.9% 37 76.9459 21.49282
Butter 10 20.8% 38 40.4737 24.26847
Vegetables 10 20.8% 38 38.4211 23.95788
Fruits 10 20.8% 38 77.4474 22.34822
Oil 8 16.7% 40 38.5750 23.10055
Snacks 7 14.6% 41 36.9512 23.60715

 Table 1 Variable Summary in STATA

  1. Furthermore, missing value pattern helps analyze whether it is a random missing data or systematic. As the figure below shows, there is no particular trend among all the variables with regard to their missing data. Therefore, it is random missing data case.
Figure 1: Missing Data Pattern in SPSS
Figure 1: Missing Data Pattern in SPSS

Analyze>Multiple Imputation>Impute Missing Data Values

After rectifying the randomness pattern, apply missing data imputation. The image below shows the results of the missing values.

Figure 2: Results for Missing Data Imputation
Figure 2: Results for Missing Data Imputation

As the results show, the numbers in yellow colour are the new values occurring after the multiple imputation process.

There are lots of software available in the market that support missing data imputation method with multiple independent variables, such as R, SAS, MATLAB, STATA and SPSS.

Priya is the co-founder and Managing Partner of Project Guru, a research and analytics firm based in Gurgaon. She is responsible for the human resource planning and operations functions. Her expertise in analytics has been used in a number of service-based industries like education and financial services.

Her foundational educational is from St. Xaviers High School (Mumbai). She also holds MBA degree in Marketing and Finance from the Indian Institute of Planning and Management, Delhi (2008).

Some of the notable projects she has worked on include:

  • Using systems thinking to improve sustainability in operations: A study carried out in Malaysia in partnership with Universiti Kuala Lumpur.
  • Assessing customer satisfaction with in-house doctors of Jiva Ayurveda (a project executed for the company)
  • Predicting the potential impact of green hydrogen microgirds (A project executed for the Government of South Africa)

She is a key contributor to the in-house research platform Knowledge Tank.

She currently holds over 300 citations from her contributions to the platform.

She has also been a guest speaker at various institutes such as JIMS (Delhi), BPIT (Delhi), and SVU (Tirupati).



1 thought on “How to apply missing data imputation?”