Machine learning involves solutions to predict scenarios based on past data. Logistic regression offers probability functions based on inputs and their corresponding output. In short, the dependent variable is a classification variable.
One of the most common methods of data analysis is the linear or multiple regression analysis. It helps establish the relationship between independent and dependent variables. In linear regression the dependent variable is continuous in most cases. However, if the dependent variable is categorical then simple regression will not allow for prediction. So, statisticians introduced the logistic regression method where the dependent variable is non-continuous or it is categorical. Like all regression, it is also used in predictive analysis. However, it helps describe the behavior of a variable which is binary, i.e., which can take only two values. If there are more than two categories in the dependent variable, then multinomial logistic regression is applicable instead of simple logistic regression. Below is an example of how this test works.
Application of logistic regression
This section contains a case study to explain the application of logistic regression on a dataset. The case dataset here contains two series, series X and Y. Series X indicates the CGPA (Cumulative Grade Points Average) of ten students of a school. Series Y indicates whether these students gets admission in college, where 1= admit, and 0=no admit. The table below represents their values.
Table 1: Values for logistic regression case 1
Here, logistic regression will help assess what level of CGPA leads to admission in college. This is called a ‘binary classification’ (either 1 or 0) problem. These types of cases need logistic regression. The aim here is to come up with the probability function that takes input X (CGPA) with outcome ‘what is the probability of getting admission’. Hence this test estimates a probability of each student based on its score. Furthermore, based on that probability, the analysis predicts if the student will get admission or not.
Logistic regression analysis is used to get the outcome for following expressions:
This function is also called the ‘log of odds’ of an event. On the basis of the above log of odds function the multiple regression equation for logistic regression can be:
Logistic regression case
Undertaken again a case study of students appearing for job interview, the aim is to determine the applicability of scores of interviews, CGPA, written and aptitude tests in hiring.
The above data was processed using SPSS. The images below show the results of the test on the above data. The first table displays the model summary and the significance level. They indicate if the undertaken variables were enough to ascertain the job status.
Here, Cox & Snell R squared and Nagelkerke R squared are similar to the R squared in the normal linear regression analysis. This explains the variation in the dependent variable due to independent variables included in the logistic model.
The second table list outs the SPSS generated ‘predicted’ job status, based on the trends in undertaken variables.
Applications of logistic regression
When classification is needed on the dependent variable, this test is applicable. Below are some examples showing its application.
- A credit card company can apply logistic regression to categorize the people in two types; good credit and bad credit based on the characteristics such as annual salary, monthly credit card payments and number of defaults.
- A hospital can apply this test to classify patients into critical and non-critical categories based on selected health attributes.
- Insurance companies use logistic regression to predict the chances that the policy holder will die before the term of policy expires based on characteristics like gender, age and physical examination.
- A bank may use this test to predict the chances that a loan applicant will default on loan or not, based on past debts, past defaults and annual income.
- Marketers may use it in examining whether a customer will respond to a particular ad based on preference of customer and type of web portal customer is active.
Software that support this test with multiple independent variables are R, SAS, MATLAB, STATA and SPSS.