How to apply missing data imputation?

Missing data is one of the most common problems in almost all statistical analyses. If the data is not available for all the observations of variables in the model, then it is a case of ‘missing data’. Missing data are part of almost all researches. They are also a common problem in most scientific research domains such as biology and medicine. If missing values are not treated well then complications arise in handling and analyzing the data. This impacts the efficiency of analysis and produces biased results. Therefore missing data imputation is important, wherein the missing values are replaced with substituted values. Missing data problem can arise due to various reasons such as mishandling of samples or by deleting unwanted values.

Types of missing data values in a dataset

There are 3 types of missing data values in a dataset:-

  • Missing completely at random (MCAR): This type of missing values occur at random, without having any relation with the values which are available to observe. Their values are missing.
  • Missing at random (MAR): The reason behind the occurrences of this type of missing values is that unintentional omitting errors occur while collecting the data. This type of missing value does not concern with the value of the variable containing missing. Also, it mostly pertains to a particular variable only.
  • Missing not at random (MNAR): This type of missing values mostly occur because of the intentional omission of the values in a particular feature at the time of collection of data. In other words, this type of missing value does not occur randomly and depends on known or unknown variable.

Methods for handling missing data imputation

There are many ways to handle missing data. The most common methods are as follows:

Mean or mode imputation

Mean imputation is a method in which the mean of the available cases replaces the missing value of a certain variable. This method maintains the sample size and is easy to handle, but the variability in the data is lesser. Therefore the standard deviations and the variance estimates tend is set too low. For example if in the data set (1, 3,4, 7, x,10), one value is missing, denoted by ‘x’. So according to the mean/mode imputation the missing value depends upon the mean of all other numbers in the data set. So here the value of x will be [(1+3+4+7+10)/5] which is equal to 5.

Regression imputation

In regression imputation the imputed value is predicted from a regression equation. A regression model is first helps to predict the values of variable (which has missing data) using the other variables. Then using that regression model imputation of values for missing data takes place.

Stochastic regression

This regression aims to reduce the bias by an extra step of augmenting each predicted score with a residual term. This is similar to the regression imputation, with error term included, which was not the case in simple regression imputation.

Hot Deck Imputation

Hot-deck imputation is a technique matching non-respondents to resembling respondents and the imputation of missing value takes place with the score of that similar respondent. To run the hot deck imputation, first step is to arrange the data as per the number of variables. After sorting, duplication of the missing value takes place, as per the value which is immediately prior to the missing value.

Cold deck

A systematic selection of value from an individual who has similar values on other variables takes place. In cold deck the response of the same respondents from other variables is relevant, whereas in hot deck the response from other respondents from same variable is relevant.

Interpolation and extrapolation

One can estimate this value from other observations from the same individual. This is similar to the cold deck missing data imputation, the only difference is that in cold deck the specific value is selected whereas in this case the value is estimated.

Multiple imputation method

Multiple imputation overcomes the problem of noise in the single imputation method. Unlike the single imputation, in the multiple imputation method the imputed values are extracted m times rather than just once.

SPSS example of missing data imputation 

In many research studies due to circumstances beyond the control of researchers, the problem of missing data arises. This section presents a case study to explain the process of analysis using missing data imputation.

This case study consists of a dataset of the relative consumption of food items in European and Scandinavian countries. After entering the data in SPSS, multiple imputation method helps generate missing values. The process for generating results is as follows.

Analyze> Multiple Imputation> Analyze Pattern

First it is important to check the patterns of missing observations in the dataset in order to apply the appropriate imputation method. Here two different results appear. First is the variable summary where one can observe the percentage of missing observations in data set. For instance, as image below shows, 27% of data is missing for Coffee, 22.9% for Tea and so on.

Variable Summary
Missing Valid N Mean Std. Deviation
N Percent
Coffee 13 27.1% 35 79.0857 20.00349
Tea 11 22.9% 37 77.9459 22.45359
Bread 11 22.9% 37 76.9459 21.49282
Butter 10 20.8% 38 40.4737 24.26847
Vegetables 10 20.8% 38 38.4211 23.95788
Fruits 10 20.8% 38 77.4474 22.34822
Oil 8 16.7% 40 38.5750 23.10055
Snacks 7 14.6% 41 36.9512 23.60715

 Table 1 Variable Summary in STATA

  1. Furthermore, missing value pattern helps analyze whether it is a random missing data or systematic. As the figure below shows, there is no particular trend among all the variables with regard to their missing data. Therefore, it is random missing data case.
Figure 1: Missing Data Pattern in SPSS

Figure 1: Missing Data Pattern in SPSS

Analyze>Multiple Imputation>Impute Missing Data Values

After rectifying the randomness pattern, apply missing data imputation. The image below shows the results of the missing values.

Figure 2: Results for Missing Data Imputation

Figure 2: Results for Missing Data Imputation

As the results show, the numbers in yellow colour are the new values occurring after the multiple imputation process.

Software supporting missing data imputation

There are lots of software available in the market that support missing data imputation method with multiple independent variables, such as R, SAS, MATLAB, STATA and SPSS.

Prateek Sharma

Prateek Sharma

Analyst at Project Guru
Prateek has completed his graduation in commerce with a rich experience in Telecom, Marketing and Banking domains for preparing comprehensive documents and reports while managing internal and external data analysis. He is an adaptable business-minded Data Analyst at Project Guru skilled in recording, interpreting and analysing data with a demonstrated ability to deliver valuable insights via data analytics and advanced data-driven methods. Apart from his strong passion towards data science, he finds extreme sports interesting. He keeps himself updated with the latest tech and always love to learn more about latest gadgets and technology.
Prateek Sharma

Latest posts by Prateek Sharma (see all)

Related articles

  • How to perform and apply Monte Carlo simulation? Monte Carlo simulation is an extension of statistical analysis where simulated data is produced. This method uses repeated sampling techniques to generate simulated data.
  • How to perform nonlinear regression? Regression analysis is a statistical tool to study the relationship between variables. These variables are the outcome variable and one or more exposure variables. In other words, regression analysis is an equation which predicts a response from the value of a certain predictor.
  • How to perform cluster analysis? While many statistical methods in machine learning are used either to predict or analyse trends in the data, cluster analysis is used for organizing the data. It is a process of grouping observations of similar kinds within a large population.
  • How to use an instrumental variable? Instrumental variable is a third variable that estimates causal relationships in the regression analysis when an endogenous variable is present. Instrumental variables are useful when the independent variable in the regression model correlates with the error term in the model.
  • How to conduct survival analysis? Survival analysis is a method under predictive modeling where the dependent variable is time. Therefore, it involves time-to-event prediction modeling. The methodology is that our outcome variable is time until the occurrence of a certain event.

Discuss

We are looking for candidates who have completed their master's degree or Ph.D. Click here to know more about our vacancies.