How to perform cluster analysis?

By Priya Chetty on December 5, 2017

While many statistical methods in machine learning are used either to predict or analyse trends in the data, cluster analysis is used for organizing the data. It is a process of grouping observations of similar kinds within a large population. Therefore, it tries to identify homogenous groups of cases. Furthermore, cluster analysis is an exploratory analysis technique that tries to identify the structures in data. Because it is explorative, it does distinguish between dependent and independent variables. It is also largely used as a sequence of analysis. For instance, in case of factor analysis or discriminant analysis, it helps identify groups and profiles the clusters.

How cluster analysis works?

This section presents a case study to explain the application of cluster analysis on a dataset. The sample dataset has 10 different ice cream stores in city. They sell two different flavors of ice creams. Below table presents the sale of each flavor of ice cream in each store of the city.

Store number Vanilla chocolate
S1 12 6
S2 15 16
S3 18 17
S4 10 8
S5 8 7
S6 9 6
S7 18 17
S8 12 9
S9 20 18
S10 21 15

Table 1: Dataset of ice-cream sales

Here, the unit or time frame does not matter. To have an idea about the sales, plot the data in a scatterplot, based on different stores, as shown in graph below. In the below graph, the dots represents different stores in a city. While the X axis represents the sale of vanilla ice cream, the Y axis represents the sale of chocolate ice cream. It shows a clear trend for both flavours; some stores demonstrate similar sales patterns. Therefore, the eight stores can be divided into two distinctive groups depending upon their sales. This classification is called cluster analysis.

Scatterplot showing sample dataset of ice-cream sales for cluster analysis
Scatterplot showing sample dataset of ice-cream sales

When cluster analysis is applied, the eight stores can be divided in two different groups. In Group 1, the stores sell between 5 and 15 units of both vanilla and chocolate, while Group 2 stores sell between 15 and 20 units of both vanilla and chocolate ice-cream.

Application of cluster analysis

Cluster analysis can be applied in different fields, both as a part of sequence of analysis as well as an independent test. For example:

  • Healthcare sector can use cluster analysis for diagnostic clusters. It can help identify groups of patients with similar symptoms and also maximize the difference between the groups.
  • Marketing involves maximum use of cluster analysis for customer segmentation. For instance, it helps segregate customers as per their needs, attitudes, attributes, demographics, and behavior.
  • Education sector can as well involve the use of this method. It can help identify what homogeneous groups exist among students (for example, high achievers in all subjects, or students that excel in certain subjects but fail in others, etc.).
  • Biology researches can also make the use of cluster analysis with different sets of data on plants and their phenotypes. For instance, it can group observations into series of clusters and build a taxonomy tree of groups and subgroups of similar plants.

Software that support this method include R, SAS, MATLAB, STATA and SPSS. Cluster analysis can also be performed on qualitative data using compatible software like NVivo.

Discuss