How to use K-Nearest Neighbor (KNN) algorithm on a dataset?

By Prateek Sharma & Priya Chetty on July 16, 2018

K- Nearest Neighbor, popular as K-Nearest Neighbor (KNN), is an algorithm that helps to assess the properties of a new variable with the help of the properties of existing variables. KNN is applicable in classification as well as regression predictive problems. KNN is a simple non-parametric test. It does not involve any internal modelling and does not require data points to have certain properties. It simply takes the voting of the majority of variables and accordingly treats new variables.

For example, a company manufactures tissue papers and tests them for acid durability and strength. Depending upon the test results, it classifies the new paper tissues as either ‘good’ or ‘bad’. Now, if the company produces a type of tissue paper it can use K-Nearest Neighbor to decide the labels for newly produced tissues.

How K-Nearest Neighbor (KNN) works?

Suppose there is a scatter plot of two variables, ‘a’ and ‘o’. Then a third variable, ‘c’ is introduced to the scatter plot. Now to label this variable as existing ones, KNN can be applied.

Figure 1: Scatterplot of variables for K-Nearest Neighbor (KNN) example — Figure 1: Scatter plot of variables for K-Nearest Neighbor **(KNN)** example

To start with KNN, consider a hypothesis of the value of ‘K’. Suppose K = 3 in this example. Here, K is the nearest neighbour and wishes to take a vote from three existing variables. Therefore, K Nearest Neighbor will be used. The variable ‘c’ will be encircled taking three more existing variables which are nearest.

Figure 2 : Scatterplot of variables in KNN example — Figure 2 : Scatterplot of variables in **KNN** example

For instance, using KNN with K = 3, the three nearest neighbours of the new variable ‘c’ were encircled. Then, a vote from the three selected nearest neighbours shows that there are two ‘a’ and one ‘o’ variables. Since variable ‘a’ is more in number than variable ‘o’, the new variable ‘c’ must be labelled as ‘a’.

Numerical example of KNN in SPSS

This section gives an example to show the application of the K-Nearest Neighbor algorithm in SPSS. The chosen dataset contains various test scores of 30 students. So, on the basis of these scores, the K Nearest Neighbor test can be used to find the nearest neighbour for ‘application status’. In this case the variable ‘status’ has only two values; 1- hired and 0- not hired.

Table 1 Data set used for KNN test in SPSS — Table 1 Data set used for **KNN** test in SPSS

As discussed above, the KNN test uses the nearest value to predict the target variable. For example, the value of the point shown in the figure below can be predicted (marked in red).

Table 2: Using KNN test find the nearest values for prediction — Figure 3: The predicted value of the point (in red)

Figure 3: Nearest values from the point (in red) — Figure 4: Nearest values from the point (in red)

The result above shows three red lines leading to the three nearest values from the point. The X-axis (written score), Y-axis (aptitude score) and Z-axis (CGPA) are the variables for prediction.

Figure 5: Peer chart for KNN test in SPSS — Figure 5: Peer chart for **KNN** test in SPSS

Similarly, the peer chart shows which value is used from which variable to predict the new variable based on the nearest value. In the peer chart, the values in red are the nearest values for predicting the new variable whereas the values of the blue dot are idle. The numbering within the chart represents the respondent. For example, 1 is the data for the first respondent, which the algorithm uses to predict values or groups in the response variable. The peer chart also shows the data which is to be used for training the model and left for validation. So far there is no Holdout data in this dataset and all the data is used for training the KNN model.

Thus, K Nearest Neighbor helped in classifying the applicants into two groups (i.e. hired, not hired) based on their acquired CGPA, aptitude and written tests. It helped the hiring company to easily collect the data containing the candidate’s information and evaluate it accordingly.

Offer ID is invalid

Applications of K-Nearest neighbor

Apart from using a classification tool like described above, KNN has several further applications in the practical world:

It is popular in search applications. For example, if one wants to find the most similar documents to a certain document, i.e. for detecting plagiarism, KNN algorithm is ideal.
It is also applicable in recommender systems in order to search for items which are similar to those in demand by other users. Here, K Nearest Neighbor will help deduce that items liked commonly by two or more people tend to be similar.
The algorithm also has many uses in data mining and machine learning. One particular use of K Nearest Neighbor is in anomaly detection. Anomaly detection is the identification of items, events, or observations which do not conform to the expected patterns or other items in a dataset.

Software that supports K-Nearest Neighbor best are R, SAS, MATLAB, STATA and SPSS.

Priya Chetty

I am a management graduate with specialisation in Marketing and Finance. I have over 12 years' experience in research and analysis. This includes fundamental and applied research in the domains of management and social sciences. I am well versed with academic research principles. Over the years i have developed a mastery in different types of data analysis on different applications like SPSS, Amos, and NVIVO. My expertise lies in inferring the findings and creating actionable strategies based on them.

Over the past decade I have also built a profile as a researcher on Project Guru's Knowledge Tank division. I have penned over 200 articles that have earned me 400+ citations so far. My Google Scholar profile can be accessed here.

I now consult university faculty through Faculty Development Programs (FDPs) on the latest developments in the field of research. I also guide individual researchers on how they can commercialise their inventions or research findings. Other developments im actively involved in at Project Guru include strengthening the "Publish" division as a bridge between industry and academia by bringing together experienced research persons, learners, and practitioners to collaboratively work on a common goal.

How K-Nearest Neighbor (KNN) works?

Numerical example of KNN in SPSS

Applications of K-Nearest neighbor

Discuss

proofreading