Linear discriminant model is a multivariate model. It is used for modeling the differences in groups. In this model a categorical variable can be predicted through continuous or binary dependent variable. Linear discriminant analysis allows researchers to separate two or more classes, objects and categories based on the characteristics of other variables. It is a classification technique like logistic regression. However the main difference between discriminant analysis and logistic regression is that instead of dichotomous variables, discriminant analysis involves variables with more than two classifications.
For example, discriminant analysis helps determine whether students will go to college, trade school or discontinue education.
Therefore, it works on the possible patterns in student’s attributes that can decide students’ categorization. This categorization is done on the basis of patterns of selected attributes.
For example, in case of student’s score, family income or student’s participation in co-curricular activities are the attributes.
Here discriminant analysis will treat these variables, i.e. student’s score, family income or student’s participation as independent variables to predict student’s classification. Hence, in this case the dependent variable has three more categories. Therefore, logistic regression is not compatible in such cases.
How linear discriminant analysis works
Linear discriminant analysis creates an equation which minimizes the possibility of wrongly classifying cases into their respective groups or categories. It includes a linear equation of the following form:
D = a1*X1 + a2*X2 + ……… + ai*Xi + b, where: D= discriminant function X-= responses for the variable (attributes) “a” = discriminant coefficient B = constant, and “i”= number of discriminant variables.
Similar to linear regression, discriminant analysis also minimizes errors. It also iteratively minimizes the possibility of misclassification of variables. Therefore, choose the best set of variables (attributes) and accurate weightage for each variable to minimize the possibility of misclassification.
Assumptions of discriminant analysis
Discriminant analysis works on some strong assumptions. These assumptions mark its difference from logistic regression, which are:
- There must be two or more groups or categories.
- There must be at least two respondents (observational units, like students in above case).
- The number of discriminating variables in the model must be less than the total number of respondents minus 2.
- Discriminating variables are measured at the interval or ratio scale level. Dummy variables also work well.
- No discriminating variable may be a linear combination of the other discriminating variables.
- The covariance matrices must be approximately equal for each group, except for cases using special formulas.
- Each group derives from a population with normal distribution on the discriminating variables. Group sizes should not be too different, otherwise the units will tend to have over prediction of membership in the largest group.
Example of linear discriminant analysis
This section explains the application of this test using hypothetical data. The case involves a dataset containing categorization of credit card holders as ‘Diamond’, ‘Platinum’ and ‘Gold’ based on frequency of credit card transactions, minimum amount of transactions and credit card payment. Therefore, the aim is to apply this test in classifying the card holders into these three categories.
The first step is to test the assumptions of discriminant analysis which are:
- Normality in data.
- Variables should be exclusive and independent (no perfect correlation among variables).
- Homogenous variance.
SPSS software was used for conducting the discriminant analysis. Results are as follows:
|Function||Eigenvalue/td>||% of Variance||Cumulative %||Canonical Correlation|
|a. First 2 canonical discriminant functions were used in the analysis.|
|Eigenvalues from the discriminant analysis in SPSS|
Eigenvalues shows the discriminating ability of the function. These values are the matrix product from the inverse function of the “within groups sum of squares”. Similarly, the canonical correlation values are the correlation between the grouping of the dependent variable and the predictor variables.
|Test of Function(s)||
|1 through 2||
Wilks’ lambda values from discriminant analysis in SPSS
Similarly, the Wilks’ lambda is another statistical output from the discriminant analysis. In this case the Wilks’ lambda is calculated by using the following equation.
Wilks’ lambda = [1- (0.289)2]* [1-(0.209)2]
The tables below explain the results. The first table shows the classification results. Here notice that the classification of ‘Diamond’ shows 50% prediction accuracy by test attributes (variables). Consequently, the classification of ‘Platinum’ and ‘Gold’ shows 30% and 20% accuracy in prediction by test variables.
|Classification||Predicted Group Membership||Total|
|a. 50.0% of original grouped cases correctly classified.|
|b. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case.|
|c. 33.3% of cross-validated grouped cases correctly classified.|
|Prediction from the discriminant analysis in SPSS.|
Furthermore, the table below represents the predicted results of the discriminant analysis of the above case.
Application of discriminant analysis
Application of discriminant analysis is similar to that of logistic regression. However, it requires additional conditions fulfillment suggested by assumptions and presence of more than two categories in variables. Also discriminant analysis is applicable in a small sample size unlike logistics regression. A few instances where discriminant analysis is applicable are; evaluation of product/ service quality. Furthermore banks also use it for promotional strategies.
Lastly, softwares that supports linear discriminant analysis are R, SAS, MATLAB, STATA and SPSS.
Latest posts by Priya Chetty (see all)
- ARIMA modeling for time series analysis in STATA - March 20, 2018
- Auto regressive distributed lag model (ARDL) and its advantages - February 16, 2018
- Building univariate ARIMA model for time series analysis in STATA - February 6, 2018