How to perform cluster analysis?

By Priya Chetty on December 5, 2017

While many statistical methods in machine learning are used either to predict or analyse trends in the data, cluster analysis is used for organizing the data. It is a process of grouping observations of similar kinds within a large population. Therefore, it tries to identify homogenous groups of cases. Furthermore, cluster analysis is an exploratory analysis technique that tries to identify the structures in data. Because it is explorative, it does distinguish between dependent and independent variables. It is also largely used as a sequence of analysis. For instance, in case of factor analysis or discriminant analysis, it helps identify groups and profiles the clusters.

How cluster analysis works?

This section presents a case study to explain the application of cluster analysis on a dataset. The sample dataset has 10 different ice cream stores in city. They sell two different flavors of ice creams. Below table presents the sale of each flavor of ice cream in each store of the city.

Store number Vanilla chocolate
S1 12 6
S2 15 16
S3 18 17
S4 10 8
S5 8 7
S6 9 6
S7 18 17
S8 12 9
S9 20 18
S10 21 15

Table 1: Dataset of ice-cream sales

Here, the unit or time frame does not matter. To have an idea about the sales, plot the data in a scatterplot, based on different stores, as shown in graph below. In the below graph, the dots represents different stores in a city. While the X axis represents the sale of vanilla ice cream, the Y axis represents the sale of chocolate ice cream. It shows a clear trend for both flavours; some stores demonstrate similar sales patterns. Therefore, the eight stores can be divided into two distinctive groups depending upon their sales. This classification is called cluster analysis.

Scatterplot showing sample dataset of ice-cream sales for cluster analysis
Scatterplot showing sample dataset of ice-cream sales

When cluster analysis is applied, the eight stores can be divided in two different groups. In Group 1, the stores sell between 5 and 15 units of both vanilla and chocolate, while Group 2 stores sell between 15 and 20 units of both vanilla and chocolate ice-cream.

Application of cluster analysis

Cluster analysis can be applied in different fields, both as a part of sequence of analysis as well as an independent test. For example:

  • Healthcare sector can use cluster analysis for diagnostic clusters. It can help identify groups of patients with similar symptoms and also maximize the difference between the groups.
  • Marketing involves maximum use of cluster analysis for customer segmentation. For instance, it helps segregate customers as per their needs, attitudes, attributes, demographics, and behavior.
  • Education sector can as well involve the use of this method. It can help identify what homogeneous groups exist among students (for example, high achievers in all subjects, or students that excel in certain subjects but fail in others, etc.).
  • Biology researches can also make the use of cluster analysis with different sets of data on plants and their phenotypes. For instance, it can group observations into series of clusters and build a taxonomy tree of groups and subgroups of similar plants.

Software that support this method include R, SAS, MATLAB, STATA and SPSS. Cluster analysis can also be performed on qualitative data using compatible software like NVivo.

Priya is the co-founder and Managing Partner of Project Guru, a research and analytics firm based in Gurgaon. She is responsible for the human resource planning and operations functions. Her expertise in analytics has been used in a number of service-based industries like education and financial services.

Her foundational educational is from St. Xaviers High School (Mumbai). She also holds MBA degree in Marketing and Finance from the Indian Institute of Planning and Management, Delhi (2008).

Some of the notable projects she has worked on include:

  • Using systems thinking to improve sustainability in operations: A study carried out in Malaysia in partnership with Universiti Kuala Lumpur.
  • Assessing customer satisfaction with in-house doctors of Jiva Ayurveda (a project executed for the company)
  • Predicting the potential impact of green hydrogen microgirds (A project executed for the Government of South Africa)

She is a key contributor to the in-house research platform Knowledge Tank.

She currently holds over 300 citations from her contributions to the platform.

She has also been a guest speaker at various institutes such as JIMS (Delhi), BPIT (Delhi), and SVU (Tirupati).