How to apply linear discriminant analysis?

Linear discriminant model is a multivariate model. It is used for modeling the differences in groups. In this model a categorical variable can be predicted through continuous or binary dependent variable.  Linear discriminant analysis allows researchers to separate two or more classes, objects and categories based on the characteristics of other variables. It is a classification technique like logistic regression. However the main difference between discriminant analysis and logistic regression is that instead of dichotomous variables, discriminant analysis involves variables with more than two classifications.

For example, discriminant analysis helps determine whether students will go to college, trade school or discontinue education.

Therefore, it works on the possible patterns in student’s attributes that can decide students’ categorization. This categorization is done on the basis of patterns of selected attributes.

For example, in case of student’s score, family income or student’s participation in co-curricular activities are the attributes.

Here discriminant analysis will treat these variables, i.e. student’s score, family income or student’s participation as independent variables to predict student’s classification. Hence, in this case the dependent variable has three more categories. Therefore, logistic regression is not compatible in such cases.

How linear discriminant analysis works

Linear discriminant analysis creates an equation which  minimizes the possibility of wrongly classifying cases into their respective groups or categories.  It includes a linear equation of the following form:

D = a1*X1 + a2*X2 + ……… + ai*Xi + b,

where:
D= discriminant function

X-= responses for the variable (attributes)

“a” = discriminant coefficient

B = constant, and

“i”= number of discriminant variables.

Similar to linear regression, discriminant analysis also minimizes errors. It also iteratively minimizes the possibility of misclassification of variables. Therefore, choose the best set of variables (attributes) and accurate weightage for each variable to minimize the possibility of misclassification.

Assumptions of discriminant analysis

Discriminant analysis works on some strong assumptions. These assumptions mark its difference from logistic regression, which are:

  1. There must be two or more groups or categories.
  2. There must be at least two respondents (observational units, like students in above case).
  3. The number of discriminating variables in the model must be less than the total number of respondents minus 2.
  4. Discriminating variables are measured at the interval or ratio scale level. Dummy variables also work well.
  5. No discriminating variable may be a linear combination of the other discriminating variables.
  6. The covariance matrices must be approximately equal for each group, except for cases using special formulas.
  7. Each group derives from a population with normal distribution on the discriminating variables. Group sizes should not be too different, otherwise the units will tend to have over prediction of membership in the largest group.

Example of linear discriminant analysis

This section explains the application of this test using hypothetical data. The case involves a dataset containing categorization of credit card holders as ‘Diamond’, ‘Platinum’ and ‘Gold’ based on frequency of credit card transactions, minimum amount of transactions and credit card payment. Therefore, the aim is to apply this test in classifying the card holders into these three categories.

 Case dataset for linear discriminant analysis

Case dataset for linear discriminant analysis

The first step is to test the assumptions of discriminant analysis which are:

  • Normality in data.
  • Variables should be exclusive and independent (no perfect correlation among variables).
  • Homogenous variance.

SPSS software was used for conducting the discriminant analysis. Results are as follows:

Eigenvalues

Function Eigenvalue/td> % of Variance Cumulative % Canonical Correlation
1 .091a 66.6 66.6 .289
2 .046a 33.4 100.0 .209
a. First 2 canonical discriminant functions were used in the analysis.
Eigenvalues from the discriminant analysis in SPSS

Eigenvalues shows the discriminating ability of the function. These values are the matrix product from the inverse function of the “within groups sum of squares”. Similarly, the canonical correlation values are the correlation between the grouping of the dependent variable and the predictor variables.

Wilks’ Lambda

Test of Function(s)

Wilks’ Lambda

Chi-square

df

Sig.

1 through 2

.876

3.435

6

.753

2

.956

1.165

2

.558

Wilks’ lambda values from discriminant analysis in SPSS

Similarly, the Wilks’ lambda is another statistical output from the discriminant analysis. In this case the Wilks’ lambda is calculated by using the following equation.

Wilks’ lambda = [1- (0.289)2]* [1-(0.209)2]

The tables below explain the results. The first table shows the classification results. Here notice that the classification of ‘Diamond’ shows 50% prediction accuracy by test attributes (variables). Consequently, the classification of ‘Platinum’ and ‘Gold’ shows 30% and 20% accuracy in prediction by test variables.

Classification Results
Classification Predicted Group Membership Total
Diamond Platinum Gold
Original Count Diamond 6 2 2 10
Platinum 4 3 3 10
Gold 2 2 6 10
% Diamond 60.0 20.0 20.0 100.0
Platinum 40.0 30.0 30.0 100.0
Gold 20.0 20.0 60.0 100.0
Cross-validated Count Diamond 5 3 2 10
Platinum 4 3 3 10
Gold 4 4 2 10
% Diamond 50.0 30.0 20.0 100.0
Platinum 40.0 30.0 30.0 100.0
Gold 40.0 40.0 20.0 100.0
a. 50.0% of original grouped cases correctly classified.
b. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case.
c. 33.3% of cross-validated grouped cases correctly classified.
Prediction from the discriminant analysis in SPSS.

Furthermore, the table below represents the predicted results of the discriminant analysis of the above case.

Prediction from the discriminant analysis in SPSS

Prediction from the discriminant analysis in SPSS

Application of discriminant analysis

Application of discriminant analysis is similar to that of logistic regression. However, it requires additional conditions fulfillment suggested by assumptions and presence of more than two categories in variables. Also discriminant analysis is applicable in a small sample size unlike logistics regression. A few instances where discriminant analysis is applicable are; evaluation of product/ service quality. Furthermore banks also use it for promotional strategies.

Lastly, softwares that supports linear discriminant analysis are R, SAS, MATLAB, STATA and SPSS.

Priya Chetty

Partner at Project Guru
Priya Chetty writes frequently about advertising, media, marketing and finance. In addition to posting daily to Project Guru Knowledge Tank, she is currently in the editorial board of Research & Analysis wing of Project Guru. She emphasizes more on refined content for Project Guru's various paid services. She has also reviewed about various insights of the social insider by writing articles about what social media means for the media and marketing industries. She has also worked in outdoor media agencies like MPG and hotel marketing companies like CarePlus.

Related articles

  • How to apply logistic regression in a case? Machine learning involves solutions to predict scenarios based on past data. Logistic regression offers probability functions based on inputs and their corresponding output.
  • How to perform nonlinear regression? Regression analysis is a statistical tool to study the relationship between variables. These variables are the outcome variable and one or more exposure variables. In other words, regression analysis is an equation which predicts a response from the value of a certain predictor.
  • How to perform and apply Monte Carlo simulation? Monte Carlo simulation is an extension of statistical analysis where simulated data is produced. This method uses repeated sampling techniques to generate simulated data.
  • How to conduct generalized least squares test? In statistics, Generalized Least Squares (GLS) is one of the most popular methods for estimating unknown coefficients of a linear regression model when the independent variable is correlating with the residuals.
  • How to use an instrumental variable? Instrumental variable is a third variable that estimates causal relationships in the regression analysis when an endogenous variable is present. Instrumental variables are useful when the independent variable in the regression model correlates with the error term in the model.

Discuss

We are looking for candidates who have completed their master's degree or Ph.D. Click here to know more about our vacancies.