How to perform cluster analysis?

While many statistical methods in machine learning are used either to predict or analyse trends in the data, cluster analysis is used for organizing the data. It is a process of grouping observations of similar kinds within a large population. Therefore, it tries to identify homogenous groups of cases. Furthermore, cluster analysis is an exploratory analysis technique that tries to identify the structures in data. Because it is explorative, it does distinguish between dependent and independent variables. It is also largely used as a sequence of analysis. For instance, in case of factor analysis or discriminant analysis, it helps identify groups and profiles the clusters.

How cluster analysis works?

This section presents a case study to explain the application of cluster analysis on a dataset. The sample dataset has 10 different ice cream stores in city. They sell two different flavors of ice creams. Below table presents the sale of each flavor of ice cream in each store of the city.

Store number Vanilla chocolate
S1 12 6
S2 15 16
S3 18 17
S4 10 8
S5 8 7
S6 9 6
S7 18 17
S8 12 9
S9 20 18
S10 21 15

Table 1: Dataset of ice-cream sales

Here, the unit or time frame does not matter. To have an idea about the sales, plot the data in a scatterplot, based on different stores, as shown in graph below. In the below graph, the dots represents different stores in a city. While the X axis represents the sale of vanilla ice cream, the Y axis represents the sale of chocolate ice cream. It shows a clear trend for both flavours; some stores demonstrate similar sales patterns. Therefore, the eight stores can be divided into two distinctive groups depending upon their sales. This classification is called cluster analysis.

Scatterplot showing sample dataset of ice-cream sales for cluster analysis

Scatterplot showing sample dataset of ice-cream sales

When cluster analysis is applied, the eight stores can be divided in two different groups. In Group 1, the stores sell between 5 and 15 units of both vanilla and chocolate, while Group 2 stores sell between 15 and 20 units of both vanilla and chocolate ice-cream.

Application of cluster analysis

Cluster analysis can be applied in different fields, both as a part of sequence of analysis as well as an independent test. For example:

  • Healthcare sector can use cluster analysis for diagnostic clusters. It can help identify groups of patients with similar symptoms and also maximize the difference between the groups.
  • Marketing involves maximum use of cluster analysis for customer segmentation. For instance, it helps segregate customers as per their needs, attitudes, attributes, demographics, and behavior.
  • Education sector can as well involve the use of this method. It can help identify what homogeneous groups exist among students (for example, high achievers in all subjects, or students that excel in certain subjects but fail in others, etc.).
  • Biology researches can also make the use of cluster analysis with different sets of data on plants and their phenotypes. For instance, it can group observations into series of clusters and build a taxonomy tree of groups and subgroups of similar plants.

Software that support this method include R, SAS, MATLAB, STATA and SPSS. Cluster analysis can also be performed on qualitative data using compatible software like NVivo.

Priya Chetty

Partner at Project Guru
Priya Chetty writes frequently about advertising, media, marketing and finance. In addition to posting daily to Project Guru Knowledge Tank, she is currently in the editorial board of Research & Analysis wing of Project Guru. She emphasizes more on refined content for Project Guru's various paid services. She has also reviewed about various insights of the social insider by writing articles about what social media means for the media and marketing industries. She has also worked in outdoor media agencies like MPG and hotel marketing companies like CarePlus.

Latest posts by Priya Chetty (see all)

Related articles

  • How to apply logistic regression in a case? Machine learning involves solutions to predict scenarios based on past data. Logistic regression offers probability functions based on inputs and their corresponding output.
  • How to perform and apply Monte Carlo simulation? Monte Carlo simulation is an extension of statistical analysis where simulated data is produced. This method uses repeated sampling techniques to generate simulated data.
  • How to conduct path analysis? Path analysis is a graphical representation of multiple regression models. In this analysis, the graphs represent the relationship between dependent and independent variables with the help of square and arrows.
  • How to conduct survival analysis? Survival analysis is a method under predictive modeling where the dependent variable is time. Therefore, it involves time-to-event prediction modeling. The methodology is that our outcome variable is time until the occurrence of a certain event.
  • How to perform nonlinear regression? Regression analysis is a statistical tool to study the relationship between variables. These variables are the outcome variable and one or more exposure variables. In other words, regression analysis is an equation which predicts a response from the value of a certain predictor.

Discuss

We are looking for candidates who have completed their master's degree or Ph.D. Click here to know more about our vacancies.