Until recently, Karl Pearson Correlation analysis was one of the most popular methods to measure linear association between two or more than two variables in a data set. For example, establishing the Karl Pearson Correlation between X variable and Y variable, where both variables belong to a single data set. Canonical Correlation Analysis (CCA), on the other hand, helps measure the correlation among variables which are in different datasets.
For example, below figure shows two datasets:
Here, Canonical Correlation Analysis helps to estimate the possible association of different variables (like, age, sex and diet) with variables of other dataset (like heartbeat rate, hemoglobin and blood pressure). CCA is a well-known multivariate analysis method for quantifying the correlation between two sets of multidimensional variables.
How Canonical Correlation Analysis works?
As discussed above, CCA works with two different data sets. However, instead of taking correlation of each variable with other variable, it has a different technique. It takes a correlation analysis among the linear combinations of two data sets.
For instance, there are two data sets X and Y. Canonical correlations works with linear combinations of X’s variables and Y’s variables using different weights “bi”. Thereafter, a correlation between linear combination is formed with “UX” and “TY”.
Important Assumptions for Canonical Correlation
- One of the key assumptions that Canonical Correlation Analysis is based on is that the variables in the population should have Multivariate Normal or Gaussian Distribution from which the sample was taken.
- Canonical Correlation Analysis requires a large sample size same as the multivariate regression to generate a robust model.
- Canonical correlations analysis cannot be performed if multicollinearity is found among one or more variable sets. In other words, none of the variables should have correlations equal to 1 among each other.
Example for Canonical Correlation
To show the strength of association between five aptitude tests and three tests on math, reading and writing, two data sets one as standard tests (Maths, Reading and writing) and other data set as aptitude Tests (Apt1, Apt2, Apt3, Apt4 and Apt5).
Table 1: The first table reports the canonical correlation coefficients and the eigenvalues of the canonical roots. The first canonical correlation coefficient is .65723 with an explained variance of the correlation of 74.26% and an eigenvalue of 0.76042. This indicates that the hypothesis is correct as the three test scores and all five aptitude test scores are positively related.
Table 2: The second table runs and shows the significance test results for all three canonical roots at the significance level p-value<0.05 individually (Roots 1 To 3, 2 To 3, and 3 To 3). ‘Root 1 To 3’ shows all the canonical roots included. ‘Roots 2 To 3’ considers only the last two canonical roots, while ‘Roots 3 To 3’ has just the last canonical root. In this example none of the root is significant at p-value < .05.
Application of Canonical Correlation Analysis
Where multi-data sets are available, CCA is applicable. For example,
- A credit card company can apply CCA to find out the association between bank account type (Current, Savings, or Fixed Deposits) with credit cards taken.
- A healthcare research centre can apply CCA to test the association between predictors of any disease based on medical history of patients.
- Insurance companies use CCA to test association between type of insurance policies taken like life insurance, health insurance etc. and characteristics of individuals like income, age, gender, medical background.
- Marketers use CCA in examining association between customers’ demographic factors and preferences for different products.
Softwares that support CCA with multiple independent variables are R, SAS, MATLAB, STATA and SPSS. However, SPSS does not include a separate command for CCA; in that case, it is carried out using Syntax.
Latest posts by Priya Chetty (see all)
- ARIMA modeling for time series analysis in STATA - March 20, 2018
- Auto regressive distributed lag model (ARDL) and its advantages - February 16, 2018
- Building univariate ARIMA model for time series analysis in STATA - February 6, 2018