How to perform cross validation on a data set?

One of the important aspects of data mining is checking the fitness of models for prediction making. However, while checking validity of the models, it becomes difficult to make conclusion since no benchmark result is available for model. Thus to assess the model, a common practice in data science is to iterate over various models and select the most appropriate model. In other words it is important to test the same model with different values of parameters. This is called the cross validation method. It helps to rule out all the possibilities of selecting wrong model and select the best one with accurate predictions. While performing different models testing, the improvement in score is either just or ‘overfit’ the model. For example, while building three different regression models on price and demand, there can be three different models as shown in the figure below.

Figure 1: Different models for same regression

Figure 1: Different models for same regression

As shown in the figure above, in the first model the prediction line does not count all the variation (under-fit). This is because most of the data points lie above the regression line. Similarly, in the second count the prediction line counts almost all the variations and in the third count an unrealistic function where prediction counts all the possible variation (over-fit). Thus in such cases, it is difficult to comprehend which of the above model is most appropriate. In such cases, cross validation method helps to achieve more generalized relationships.

Testing the data with cross validation

Cross validation is a technique where a part of the data is set aside as ‘training data’ and the model is constructed on both training and the remaining ‘test data’. The results from training and test data are then compared and appropriate model is selected. For instance, if there are 20 observations of data to build a model as shown below:

Figure 2: Sample dataset in cross validation

Figure 2: Sample dataset in cross validation

Cross validation allows the researcher to split these data in two or “n” sets and construct different models  to cross validate the results. Now, the data can be split in many ways like the ratio of 20:80, 50:50 or 30:70 based on number, size and format of the data. Thus, based on this segregation, cross validation can be performed through “n” number of criteria. To shorten the scope of the article, only three most appropriate criteria of cross validation have been selected.

 

The Validation set approach

This approach allows to segregate data in the sets of 50%. That means set aside 50% of data for validation and rest 50% for model training. Thus, one can perform the model on 50% of the data and then apply the same on adjusting the data to cross validate the result. One major disadvantage of this approach is that when a great percentage of data is missed from the actual data set, there are high chances of missing important information which can lead to biases.

Leave one out cross validation

In this approach, one data set point (observation) is left and the model is prepared on the remaining data sets. Iteratively, one has to prepare models by leaving each data point one by one. Although this approach minimizes the error of previous approach on biases, it comes up with “n” iterations and different models. In short it takes higher execution time.  Also with the chances of having outliers in data, the result of successive model show severe variations.

K Fold cross validation

In this approach split the data sets in “n” folds. Like in the above example, split data sets in five folds such that each fold contains four observations.

Figure 3: Splitting sample dataset into five folds for cross validation

Figure 3: Splitting sample dataset into five folds for cross validation

Then apply the models on K-1 folds. That means, if K=4, then apply the model on three folds and set aside one fold like in the image below red colored data points are set aside.

Figure 4: Iteration 1 for sample dataset

Figure 4: Iteration 1 for sample dataset

Then the repeated iterations are performed, leaving one fold at a time.

Figure 5: Iteration 2 for sample dataset

Figure 6: Iteration 3 for sample dataset

Figure 7: Iteration 4 for sample dataset

Figure 8: Iteration 5 for sample dataset

Figure 5: Different iterations for sample dataset

Furthermore, one can record the errors seen on each of the predictions. Repeat this until each of the K folds has been served as the test set. The average of K recorded errors is called the cross-validation error and will serve as performance metric for the model. Lower the average value, better the model.

Application of cross validation

Cross validation is applied in all those cases where predictive modeling techniques are used. In short, any statistical analysis aiming for forecasting certain values conducts cross validation to check the model.

Softwares that support cross validation applications includes R, SAS, MATLAB, STATA and SPSS. However, in SPSS, cross validation does not have any direct command; rather it is performed following the steps listed in above different criteria.

Priya Chetty

Partner at Project Guru
Priya Chetty writes frequently about advertising, media, marketing and finance. In addition to posting daily to Project Guru Knowledge Tank, she is currently in the editorial board of Research & Analysis wing of Project Guru. She emphasizes more on refined content for Project Guru's various paid services. She has also reviewed about various insights of the social insider by writing articles about what social media means for the media and marketing industries. She has also worked in outdoor media agencies like MPG and hotel marketing companies like CarePlus.

Related articles

  • How to conduct generalized least squares test? In statistics, Generalized Least Squares (GLS) is one of the most popular methods for estimating unknown coefficients of a linear regression model when the independent variable is correlating with the residuals.
  • How to use K-Nearest Neighbor (KNN) algorithm on a dataset? K- Nearest Neighbor, popular as K-Nearest Neighbor (KNN), is an algorithm that helps to assess the properties of a new variable with the help of the properties of existing variables. KNN is applicable in classification as well as regression predictive problems.
  • How to perform LASSO regression test? In statistics, to increase the prediction accuracy and interpret-ability of the model, LASSO (Least Absolute Shrinkage and Selection Operator) is extremely popular. It is a regression procedure that involves selection and regularisation and was developed in 1989. Lasso regression is an […]
  • How to perform nonlinear regression? Regression analysis is a statistical tool to study the relationship between variables. These variables are the outcome variable and one or more exposure variables. In other words, regression analysis is an equation which predicts a response from the value of a certain predictor.
  • How to detect outliers in a dataset? Outliers are those data points which are distant from the other observations in the data set. They can be either because of the variability in the data set or due to measurement errors.

Discuss

We are looking for candidates who have completed their master's degree or Ph.D. Click here to know more about our vacancies.