How to perform cross validation on a data set?

By Priya Chetty on December 25, 2017

One of the important aspects of data mining is checking the fitness of models for prediction making. However, while checking validity of the models, it becomes difficult to make conclusion since no benchmark result is available for model. Thus to assess the model, a common practice in data science is to iterate over various models and select the most appropriate model. In other words it is important to test the same model with different values of parameters. This is called the cross validation method. It helps to rule out all the possibilities of selecting wrong model and select the best one with accurate predictions. While performing different models testing, the improvement in score is either just or ‘overfit’ the model. For example, while building three different regression models on price and demand, there can be three different models as shown in the figure below.

Figure 1: Different models for same regression
Figure 1: Different models for same regression

As shown in the figure above, in the first model the prediction line does not count all the variation (under-fit). This is because most of the data points lie above the regression line. Similarly, in the second count the prediction line counts almost all the variations and in the third count an unrealistic function where prediction counts all the possible variation (over-fit). Thus in such cases, it is difficult to comprehend which of the above model is most appropriate. In such cases, cross validation method helps to achieve more generalized relationships.

Testing the data with cross validation

Cross validation is a technique where a part of the data is set aside as ‘training data’ and the model is constructed on both training and the remaining ‘test data’. The results from training and test data are then compared and appropriate model is selected. For instance, if there are 20 observations of data to build a model as shown below:

Figure 2: Sample dataset in cross validation
Figure 2: Sample dataset in cross validation

Cross validation allows the researcher to split these data in two or “n” sets and construct different models  to cross validate the results. Now, the data can be split in many ways like the ratio of 20:80, 50:50 or 30:70 based on number, size and format of the data. Thus, based on this segregation, cross validation can be performed through “n” number of criteria. To shorten the scope of the article, only three most appropriate criteria of cross validation have been selected.

The Validation set approach

This approach allows to segregate data in the sets of 50%. That means set aside 50% of data for validation and rest 50% for model training. Thus, one can perform the model on 50% of the data and then apply the same on adjusting the data to cross validate the result. One major disadvantage of this approach is that when a great percentage of data is missed from the actual data set, there are high chances of missing important information which can lead to biases.

Leave one out cross validation

In this approach, one data set point (observation) is left and the model is prepared on the remaining data sets. Iteratively, one has to prepare models by leaving each data point one by one. Although this approach minimizes the error of previous approach on biases, it comes up with “n” iterations and different models. In short it takes higher execution time.  Also with the chances of having outliers in data, the result of successive model show severe variations.

K Fold cross validation

In this approach split the data sets in “n” folds. Like in the above example, split data sets in five folds such that each fold contains four observations.

Figure 3: Splitting sample dataset into five folds for cross validation
Figure 3: Splitting sample dataset into five folds for cross validation

Then apply the models on K-1 folds. That means, if K=4, then apply the model on three folds and set aside one fold like in the image below red colored data points are set aside.

Figure 4: Iteration 1 for sample dataset
Figure 4: Iteration 1 for sample dataset

Then the repeated iterations are performed, leaving one fold at a time.

Figure 5: Iteration 2 for sample dataset

Figure 6: Iteration 3 for sample dataset

Figure 7: Iteration 4 for sample dataset

Figure 8: Iteration 5 for sample dataset
Figure 5: Different iterations for sample dataset

Furthermore, one can record the errors seen on each of the predictions. Repeat this until each of the K folds has been served as the test set. The average of K recorded errors is called the cross-validation error and will serve as performance metric for the model. Lower the average value, better the model.

Application of cross validation

Cross validation is applied in all those cases where predictive modeling techniques are used. In short, any statistical analysis aiming for forecasting certain values conducts cross validation to check the model.

Softwares that support cross validation applications includes R, SAS, MATLAB, STATA and SPSS. However, in SPSS, cross validation does not have any direct command; rather it is performed following the steps listed in above different criteria.

Discuss