How to perform bootstrap and jackknife analysis?

Bootstrap and jackknife are superficially similar statistical techniques that involve re-sampling the data. They are nonparametric and specific resampling techniques that can estimate standard errors and confidence intervals of a population parameter. The population parameters include mean, median, proportion, odds ratio, correlation coefficient or regression coefficient. In 1979 Bradley Efron introduced the bootstrap method for evaluating the variance of an estimator. On the other hand Quenouille in 1949 introduced the jackknife method to estimate the bias of an estimator and evaluating the variance of an estimator. Bootstrap and jackknife are helpful in calculating an appropriate sample size for experimental design.

Examples of bootstrap & jackknife

This section presents an example for the application of bootstrap and jackknife. Suppose that there are five data points:

5, 4, 8, 9, 7.

Resample the data points with replacement from original sample to create bootstrap samples. Each bootstrap sample will have a size of five, similar to the original sample. Since the data points are randomly selected, the bootstrap samples may be different from the original sample and from each other also.

The table below represents an example of 20 bootstrap samples:

4, 9, 8, 7, 4 =6.4 5, 8, 7,9, 4=6.6 9, 9,7, 4, 5=6.8 8, 8, 7, 4, 7=6.8 8, 5,7, 8, 5=6.6
5, 7,7 ,7, 8= 6.8 9, 8, 8, 8, 9=8.4 4, 9, 5, 7, 4=5.8 5, 4, 7, 4, 5=5 8, 4,5,9, 4=6
7, 4, 9, 8, 4=6.4 5, 9, 8, 4, 7=6.6 9, 9, 9, 4, 5=7.2 5, 5, 4, 4, 4= 4.4 4, 4, 5, 7, 5=5
5, 7,5, 5, 8=6 9, 7, 4, 8, 5=6.6 7, 8, 8, 5, 4= 6.4 8, 5, 4, 8, 7=6.4 5, 4, 5, 5, 5=4.8

Table 1 Creating bootstrap samples from the original sample

Results

In this case, using bootstrap to calculate a confidence interval about the population mean. Calculate the means of each of the bootstrap samples. Thus, the mean values can be arranged in ascending order as: 4.4, 4.8, 5, 5, 5.8, 6, 6, 6.4, 6.4, 6.4, 6.4, 6.6, 6.6, 6.6, 6.6, 6.8, 7.2, and 8.4.

Next, calculate the confidence interval from the bootstrap sample means. Since 95% confidence interval is the most common, use the 100th and 5th percentiles as the endpoints of the intervals. This is because,  split (100% – 95%) = 5% in order to have the middle 95% of all of the bootstrap sample means.

The confidence interval is at 4.8 to 6.8, in other words there is 95% confidence that all the sample means lies between 4.8 and 6.8.

Case study using SPSS 

In the current article bootstrapping is performed for two variables namely height and weight. Bootstrapping analysis can be applied using SPSS software. It works on a number of different analysis in SPSS. For this article bootstrapping is performed using Pearson correlation analysis. Bootstrapping and jackknife are more useful in cases where the data does not follow normal distribution.

Analyze > descriptive summary

This case uses the same dataset as in the logistic regression article. Results in Table 1 are the descriptive statistics without bootstrapping. The mean score of four different variables displaying standard error along with Skewness and Kurtosis values.

Descriptive statistics

Now, the bootstrapping has performed for the descriptive analysis with the same data. 1000 different samples were used for bootstrapping with 95 % confidence interval.

Figure 1: Descriptive statistics of sample dataset using Bootstrap in SPSS

Figure 1: Descriptive statistics of sample dataset using Bootstrap in SPSS

Analyze > Descriptive analysis > bootstrapping

Figure 2: Results of bootstrapping using SPSS

Figure 2: Results of bootstrapping using SPSS

The figure above shows the results from bootstrapping for interview scores. In this case the standard error of mean decreases from 1.67 to 1.58. This shows that bootstrapping was able to reduce the standard error which also reduces the bias in the dataset.

Applications of bootstrap & jackknife 

  • Analysis of Null models, competition and community structure.
  • Detection of density dependence.
  • Characterizing spatial patterns and processes.
  • To estimate population size and vital rates.
  • Creating environmental modeling.
  • For evolutionary processes and rates.
  • To conduct phylogeny analysis.
  • To calculate an appropriate sample size for experimental design.
  • For calculating estimator that is a sample analogue of a parameter.
  • To estimate the bias and standard error in a statistic, when a random sample of observations helps calculate it.

Software supporting bootstrap & jackknife  

A number of available software support this analysis method, like R, SAS, S PLUS, RESAMPLING STATS MATLAB, STATA and SPSS.

Prateek Sharma

Prateek Sharma

Analyst at Project Guru
Prateek has completed his graduation in commerce with a rich experience in Telecom, Marketing and Banking domains for preparing comprehensive documents and reports while managing internal and external data analysis. He is an adaptable business-minded Data Analyst at Project Guru skilled in recording, interpreting and analysing data with a demonstrated ability to deliver valuable insights via data analytics and advanced data-driven methods. Apart from his strong passion towards data science, he finds extreme sports interesting. He keeps himself updated with the latest tech and always love to learn more about latest gadgets and technology.
Prateek Sharma

Related articles

  • How to use an instrumental variable? Instrumental variable is a third variable that estimates causal relationships in the regression analysis when an endogenous variable is present. Instrumental variables are useful when the independent variable in the regression model correlates with the error term in the model.
  • How to perform and apply Monte Carlo simulation? Monte Carlo simulation is an extension of statistical analysis where simulated data is produced. This method uses repeated sampling techniques to generate simulated data.
  • How to perform LASSO regression test? In statistics, to increase the prediction accuracy and interpret-ability of the model, LASSO (Least Absolute Shrinkage and Selection Operator) is extremely popular. It is a regression procedure that involves selection and regularisation and was developed in 1989. Lasso regression is an […]
  • How to perform cluster analysis? While many statistical methods in machine learning are used either to predict or analyse trends in the data, cluster analysis is used for organizing the data. It is a process of grouping observations of similar kinds within a large population.
  • How to conduct generalized least squares test? In statistics, Generalized Least Squares (GLS) is one of the most popular methods for estimating unknown coefficients of a linear regression model when the independent variable is correlating with the residuals.

Discuss

We are looking for candidates who have completed their master's degree or Ph.D. Click here to know more about our vacancies.