# How to perform cluster analysis?

While many statistical methods in machine learning are used either to predict or analyse trends in the data, cluster analysis is used for organizing the data. It is a process of grouping observations of similar kinds within a large population. Therefore, it tries to identify homogenous groups of cases. Furthermore, cluster analysis is an exploratory analysis technique that tries to identify the structures in data. Because it is explorative, it does distinguish between dependent and independent variables. It is also largely used as a sequence of analysis. For instance, in case of factor analysis or discriminant analysis, it helps identify groups and profiles the clusters.

## How cluster analysis works?

This section presents a case study to explain the application of cluster analysis on a dataset. The sample dataset has 10 different ice cream stores in city. They sell two different flavors of ice creams. Below table presents the sale of each flavor of ice cream in each store of the city.

 Store number Vanilla chocolate S1 12 6 S2 15 16 S3 18 17 S4 10 8 S5 8 7 S6 9 6 S7 18 17 S8 12 9 S9 20 18 S10 21 15

Table 1: Dataset of ice-cream sales

Here, the unit or time frame does not matter. To have an idea about the sales, plot the data in a scatterplot, based on different stores, as shown in graph below. In the below graph, the dots represents different stores in a city. While the X axis represents the sale of vanilla ice cream, the Y axis represents the sale of chocolate ice cream. It shows a clear trend for both flavours; some stores demonstrate similar sales patterns. Therefore, the eight stores can be divided into two distinctive groups depending upon their sales. This classification is called cluster analysis.

Scatterplot showing sample dataset of ice-cream sales

When cluster analysis is applied, the eight stores can be divided in two different groups. In Group 1, the stores sell between 5 and 15 units of both vanilla and chocolate, while Group 2 stores sell between 15 and 20 units of both vanilla and chocolate ice-cream.

## Application of cluster analysis

Cluster analysis can be applied in different fields, both as a part of sequence of analysis as well as an independent test. For example:

• Healthcare sector can use cluster analysis for diagnostic clusters. It can help identify groups of patients with similar symptoms and also maximize the difference between the groups.
• Marketing involves maximum use of cluster analysis for customer segmentation. For instance, it helps segregate customers as per their needs, attitudes, attributes, demographics, and behavior.
• Education sector can as well involve the use of this method. It can help identify what homogeneous groups exist among students (for example, high achievers in all subjects, or students that excel in certain subjects but fail in others, etc.).
• Biology researches can also make the use of cluster analysis with different sets of data on plants and their phenotypes. For instance, it can group observations into series of clusters and build a taxonomy tree of groups and subgroups of similar plants.

Software that support this method include R, SAS, MATLAB, STATA and SPSS. Cluster analysis can also be performed on qualitative data using compatible software like NVivo.

### Priya Chetty

Partner at Project Guru
Priya is a master in business administration with majors in marketing and finance. She is fluent with data modelling, time series analysis, various regression models, forecasting and interpretation of the data. She has assisted data scientists, corporates, scholars in the field of finance, banking, economics and marketing.

### Related articles

• How to perform bootstrap and jackknife analysis? Bootstrap and jackknife are superficially similar statistical techniques that involve re-sampling the data. They are nonparametric and specific resampling techniques that can estimate standard errors and confidence intervals of a population parameter.
• How to apply logistic regression in a case? Machine learning involves solutions to predict scenarios based on past data. Logistic regression offers probability functions based on inputs and their corresponding output.
• How to apply missing data imputation? Missing data is one of the most common problems in almost all statistical analyses. If the data is not available for all the observations of variables in the model, then it is a case of ‘missing data’. Missing data are part of almost all researches. They are also a common problem in most […]
• How to perform and apply Monte Carlo simulation? Monte Carlo simulation is an extension of statistical analysis where simulated data is produced. This method uses repeated sampling techniques to generate simulated data.
• How to apply linear discriminant analysis? Linear discriminant model is a multivariate model. It is used for modeling the differences in groups. In this model a categorical variable can be predicted through continuous or binary dependent variable.

We are looking for candidates who have completed their master's degree or Ph.D. Click here to know more about our vacancies.