Performing Canonical Correlation Analysis (CCA)

By Priya Chetty on January 10, 2018

Until recently, Karl Pearson Correlation analysis was one of the most popular methods to measure linear association between two or more than two variables in a data set. For example, establishing the Karl Pearson Correlation between X variable and Y variable, where both variables belong to a single data set. Canonical Correlation Analysis (CCA), on the other hand, helps measure the correlation among variables that are in different datasets.

For example, the below figure shows two datasets:

Figure 1 Graph showing canonical correlation from two different data sets
Figure 1 Graph showing canonical correlation from two different data sets

Here, Canonical Correlation Analysis helps to estimate the possible association of different variables (like, age, sex and diet) with variables of other datasets (like heartbeat rate, haemoglobin and blood pressure).  CCA is a well-known multivariate analysis method for quantifying the correlation between two sets of multidimensional variables.

How canonical correlation analysis works?

As discussed above, CCA works with two different data sets. However, instead of taking correlation of each variable with other variables, it has a different technique. It takes a correlation analysis among the linear combinations of two data sets.

Figure 2: Procedure of canonical correlation analysis taking the linear combination from data set X and Y
Figure 2: Procedure of canonical correlation analysis taking the linear combination from data set X and Y

For instance, there are two data sets X and Y. Canonical correlations work with linear combinations of X’s variables and Y’s variables using different weights “bi”. Thereafter, a correlation between linear combination is formed with “UX” and “TY”.

Important assumptions for canonical correlation

  1. One of the key assumptions that Canonical Correlation Analysis is based on is that the variables in the population should have Multivariate Normal or Gaussian Distribution from which the sample was taken.
  2. Canonical Correlation Analysis requires a large sample size same as the multivariate regression to generate a robust model.
  3. Canonical correlations analysis cannot be performed if multicollinearity is found among one or more variable sets. In other words, none of the variables should have correlations equal to 1 among each other.

Example for canonical correlation

To show the strength of association between five aptitude tests and three tests on math, reading and writing, two data sets one as standard tests (Maths, Reading and writing) and other data set as aptitude Tests (Apt1, Apt2, Apt3, Apt4 and Apt5).

Table 1: The first table reports the canonical correlation coefficients and the eigenvalues of the canonical roots. The first canonical correlation coefficient is .65723 with an explained variance of the correlation of 74.26% and an eigenvalue of 0.76042. This indicates that the hypothesis is correct as the three test scores and all five aptitude test scores are positively related.

Table 2: The second table runs and shows the significance test results for all three canonical roots at the significance level p-value<0.05 individually (Roots 1 To 3, 2 To 3, and 3 To 3). ‘Root 1 To 3’ shows all the canonical roots included. ‘Roots 2 To 3’ considers only the last two canonical roots, while ‘Roots 3 To 3’ has just the last canonical root. In this example, none of the roots is significant at a p-value < .05.

Figure 4: Results from canonical correlation
Figure 4: Results from canonical correlation

Application of canonical correlation analysis

Where multi-data sets are available, CCA is applicable. For example,

  • A credit card company can apply CCA to find out the association between bank account type (Current, Savings, or Fixed Deposits) with credit cards taken.
  • A healthcare research centre can apply CCA to test the association between predictors of any disease based on medical history of patients.
  • Insurance companies use CCA to test association between type of insurance policies taken like life insurance, health insurance etc. and characteristics of individuals like income, age, gender, medical background.
  • Marketers use CCA in examining association between customers’ demographic factors and preferences for different products.

Softwares that support CCA with multiple independent variables are R, SAS, MATLAB, STATA and SPSS. However, SPSS does not include a separate command for CCA; in that case, it is carried out using Syntax.

Priya is the co-founder and Managing Partner of Project Guru, a research and analytics firm based in Gurgaon. She is responsible for the human resource planning and operations functions. Her expertise in analytics has been used in a number of service-based industries like education and financial services.

Her foundational educational is from St. Xaviers High School (Mumbai). She also holds MBA degree in Marketing and Finance from the Indian Institute of Planning and Management, Delhi (2008).

Some of the notable projects she has worked on include:

  • Using systems thinking to improve sustainability in operations: A study carried out in Malaysia in partnership with Universiti Kuala Lumpur.
  • Assessing customer satisfaction with in-house doctors of Jiva Ayurveda (a project executed for the company)
  • Predicting the potential impact of green hydrogen microgirds (A project executed for the Government of South Africa)

She is a key contributor to the in-house research platform Knowledge Tank.

She currently holds over 300 citations from her contributions to the platform.

She has also been a guest speaker at various institutes such as JIMS (Delhi), BPIT (Delhi), and SVU (Tirupati).

 

Discuss

3 thoughts on “Performing Canonical Correlation Analysis (CCA)”