How to use K-Nearest Neighbor (KNN) algorithm on a dataset?
K- Nearest Neighbor, popular as K-Nearest Neighbor (KNN), is an algorithm that helps to assess the properties of a new variable with the help of the properties of existing variables. KNN is applicable in classification as well as regression predictive problems. KNN is a simple non-parametric test. It does not involve any internal modelling and does not require data points to have certain properties. It simply takes the voting of the majority of variables and accordingly treats new variables.
For example, a company manufactures tissue papers and tests them for acid durability and strength. Depending upon the test results, it classifies the new paper tissues as either ‘good’ or ‘bad’. Now, if the company produces a type of tissue paper it can use K-Nearest Neighbor to decide the labels for newly produced tissues.
How K-Nearest Neighbor (KNN) works?
Suppose there is a scatter plot of two variables, ‘a’ and ‘o’. Then a third variable, ‘c’ is introduced to the scatter plot. Now to label this variable as existing ones, KNN can be applied.
To start with KNN, consider a hypothesis of the value of ‘K’. Suppose K = 3 in this example. Here, K is the nearest neighbour and wishes to take a vote from three existing variables. Therefore, K Nearest Neighbor will be used. The variable ‘c’ will be encircled taking three more existing variables which are nearest.
For instance, using KNN with K = 3, the three nearest neighbours of the new variable ‘c’ were encircled. Then, a vote from the three selected nearest neighbours shows that there are two ‘a’ and one ‘o’ variables. Since variable ‘a’ is more in number than variable ‘o’, the new variable ‘c’ must be labelled as ‘a’.
Numerical example of KNN in SPSS
This section gives an example to show the application of the K-Nearest Neighbor algorithm in SPSS. The chosen dataset contains various test scores of 30 students. So, on the basis of these scores, the K Nearest Neighbor test can be used to find the nearest neighbour for ‘application status’. In this case the variable ‘status’ has only two values; 1- hired and 0- not hired.
As discussed above, the KNN test uses the nearest value to predict the target variable. For example, the value of the point shown in the figure below can be predicted (marked in red).
The result above shows three red lines leading to the three nearest values from the point. The X-axis (written score), Y-axis (aptitude score) and Z-axis (CGPA) are the variables for prediction.
Similarly, the peer chart shows which value is used from which variable to predict the new variable based on the nearest value. In the peer chart, the values in red are the nearest values for predicting the new variable whereas the values of the blue dot are idle. The numbering within the chart represents the respondent. For example, 1 is the data for the first respondent, which the algorithm uses to predict values or groups in the response variable. The peer chart also shows the data which is to be used for training the model and left for validation. So far there is no Holdout data in this dataset and all the data is used for training the KNN model.
Thus, K Nearest Neighbor helped in classifying the applicants into two groups (i.e. hired, not hired) based on their acquired CGPA, aptitude and written tests. It helped the hiring company to easily collect the data containing the candidate’s information and evaluate it accordingly.
Applications of K-Nearest neighbor
Apart from using a classification tool like described above, KNN has several further applications in the practical world:
- It is popular in search applications. For example, if one wants to find the most similar documents to a certain document, i.e. for detecting plagiarism, KNN algorithm is ideal.
- It is also applicable in recommender systems in order to search for items which are similar to those in demand by other users. Here, K Nearest Neighbor will help deduce that items liked commonly by two or more people tend to be similar.
- The algorithm also has many uses in data mining and machine learning. One particular use of K Nearest Neighbor is in anomaly detection. Anomaly detection is the identification of items, events, or observations which do not conform to the expected patterns or other items in a dataset.
Software that supports K-Nearest Neighbor best are R, SAS, MATLAB, STATA and SPSS.