How to apply missing data imputation?

By Prateek Sharma & Priya Chetty on March 9, 2018

Missing data is one of the most common problems in almost all statistical analyses. If the data is not available for all the observations of variables in the model, then it is a case of ‘missing data’. Missing data are part of almost all researches. They are also a common problem in most scientific research domains such as biology and medicine. If missing values are not treated well then complications arise in handling and analyzing the data. This impacts the efficiency of analysis and produces biased results. Therefore missing data imputation is important, wherein the missing values are replaced with substituted values. Missing data problem can arise due to various reasons such as mishandling of samples or by deleting unwanted values.

Types of missing data values in a dataset

There are 3 types of missing data values in a dataset:-

Missing completely at random (MCAR): This type of missing values occur at random, without having any relation with the values which are available to observe. Their values are missing.
Missing at random (MAR): The reason behind the occurrences of this type of missing values is that unintentional omitting errors occur while collecting the data. This type of missing value does not concern with the value of the variable containing missing. Also, it mostly pertains to a particular variable only.
Missing not at random (MNAR): This type of missing values mostly occur because of the intentional omission of the values in a particular feature at the time of collection of data. In other words, this type of missing value does not occur randomly and depends on known or unknown variable.

Methods for handling missing data imputation

There are many ways to handle missing data. The most common methods are as follows:

Mean or mode imputation

Mean imputation is a method in which the mean of the available cases replaces the missing value of a certain variable. This method maintains the sample size and is easy to handle, but the variability in the data is lesser. Therefore the standard deviations and the variance estimates tend is set too low. For example if in the data set (1, 3,4, 7, x,10), one value is missing, denoted by ‘x’. So according to the mean/mode imputation the missing value depends upon the mean of all other numbers in the data set. So here the value of x will be [(1+3+4+7+10)/5] which is equal to 5.

Regression imputation

In regression imputation the imputed value is predicted from a regression equation. A regression model is first helps to predict the values of variable (which has missing data) using the other variables. Then using that regression model imputation of values for missing data takes place.

Stochastic regression

This regression aims to reduce the bias by an extra step of augmenting each predicted score with a residual term. This is similar to the regression imputation, with error term included, which was not the case in simple regression imputation.

Hot Deck Imputation

Hot-deck imputation is a technique matching non-respondents to resembling respondents and the imputation of missing value takes place with the score of that similar respondent. To run the hot deck imputation, first step is to arrange the data as per the number of variables. After sorting, duplication of the missing value takes place, as per the value which is immediately prior to the missing value.

Cold deck

A systematic selection of value from an individual who has similar values on other variables takes place. In cold deck the response of the same respondents from other variables is relevant, whereas in hot deck the response from other respondents from same variable is relevant.

Interpolation and extrapolation

One can estimate this value from other observations from the same individual. This is similar to the cold deck missing data imputation, the only difference is that in cold deck the specific value is selected whereas in this case the value is estimated.

Multiple imputation method

Multiple imputation overcomes the problem of noise in the single imputation method. Unlike the single imputation, in the multiple imputation method the imputed values are extracted m times rather than just once.

SPSS example of missing data imputation

In many research studies due to circumstances beyond the control of researchers, the problem of missing data arises. This section presents a case study to explain the process of analysis using missing data imputation.

This case study consists of a dataset of the relative consumption of food items in European and Scandinavian countries. After entering the data in SPSS, multiple imputation method helps generate missing values. The process for generating results is as follows.

Analyze> Multiple Imputation> Analyze Pattern

First it is important to check the patterns of missing observations in the dataset in order to apply the appropriate imputation method. Here two different results appear. First is the variable summary where one can observe the percentage of missing observations in data set. For instance, as image below shows, 27% of data is missing for Coffee, 22.9% for Tea and so on.

Variable Summary
	Missing		Valid N	Mean	Std. Deviation
	N	Percent	Valid N	Mean	Std. Deviation
Coffee	13	27.1%	35	79.0857	20.00349
Tea	11	22.9%	37	77.9459	22.45359
Bread	11	22.9%	37	76.9459	21.49282
Butter	10	20.8%	38	40.4737	24.26847
Vegetables	10	20.8%	38	38.4211	23.95788
Fruits	10	20.8%	38	77.4474	22.34822
Oil	8	16.7%	40	38.5750	23.10055
Snacks	7	14.6%	41	36.9512	23.60715

Table 1 Variable Summary in STATA

Furthermore, missing value pattern helps analyze whether it is a random missing data or systematic. As the figure below shows, there is no particular trend among all the variables with regard to their missing data. Therefore, it is random missing data case.

Analyze>Multiple Imputation>Impute Missing Data Values

After rectifying the randomness pattern, apply missing data imputation. The image below shows the results of the missing values.

Figure 2: Results for Missing Data Imputation

As the results show, the numbers in yellow colour are the new values occurring after the multiple imputation process.

There are lots of software available in the market that support missing data imputation method with multiple independent variables, such as R, SAS, MATLAB, STATA and SPSS.