# How to process the primary dataset for a regression analysis?

By Riya Jain and Priya Chetty on March 3, 2020
Photo by bongkarn thanyakij from Pexels

Regression analysis signifies the extent of the relationship between the dependent and independent variables. Randomness and biases are inherent in primary data, i.e. data collected from first-hand sources through surveys. The previous article has discussed in detail the relevance of significant regression results. It also showed that in order to remove variability in data, data processing is required. This article demonstrates with an example of how to reduce variability in a dataset in order to generate a ‘significant’ regression value. Before beginning to process the dataset, an initial regression test is important.

## Processing the regression dataset before the analysis

For this article, let’s consider the same example used in the correlation article. The aim was to determine the impact of personal factors (independent variables) on women’s leadership skills (dependent variable). Suppose from the literature review it was found that two factors that affect women’s leadership are:

• Creative and initiative ability

The first step in conducting the regression test is to check for correlation. The previous article already established a moderate correlation between the dependent and independent variables. Therefore we can now proceed to the regression test. The following hypothesis can be framed:

H0 (null hypothesis): There is no significant impact of personal factors on the women leadership.

HA (alternate hypothesis): There is a significant impact of personal factors on the women leadership.

Results of the regression test are shown below.

Table 1: Original Regression results

The value of R2 and Adjusted R2 in the above model is 0.634 and 0.607, thus showing about 60.7% of the variation in the value of leadership is explained by the independent variables. Furthermore, F-ratio also signifies the appropriate prediction of women leadership based on the two independent variables. But the p-value of ‘creativity and initiative ability of a leader’ is greater than the required level of the significance (0.05). Therefore the dataset needs to be processed.

Note: – It is not always essential to have significant results for all the independent variables. In the case of a large number of independent variables, some of them can be left as ‘insignificant’. In the present case, we will process the dataset of ‘creativity and initiative ability of a leader’ to make the result significant.

## Processing of dataset to remove variability from the sample

The original dataset is as shown below.

Values of the observations in the independent variable should be brought closer to the values of the dependent variable. For example, 8th row shows different values for dependent and independent variable, therefore this difference is reduced by changing the independent variable (creativity and competence) value to 2. The modification of the dataset is shown in the below figure.

In case of competence of a leader, value at 4th row is modified from 5 to 4 or 10th-row value is replaced from 1 to 2.

## Results of the regression analysis

After completing the processing of the dataset, the regression analysis is performed. Results are shown in the below table.

Table 2: Final Regression results

Above table clearly depicts that the relationship between the variables has improved by reducing the variability. By comparing the R2 and Adjusted R2, it could be seen that value has increased from 0.634 and 0.607 to 0.703 and 0.681 respectively. Adjusted R2 shows that about 78.1% of the variation in the dependent variable is now explained by the independent variables.

Furthermore, F-ratio though initially was also greater than 1 but now the value has increased further from 23.365 to 31.966. The F-ratio test thus states that prediction of the creativity level of students has improved by considering emotional awareness factors i.e. not hiding emotions and analyzing emotions as the independent variable.

Finally, the hypothesis test reveals an acceptable p-value of both independent variables (0.042 and 0.000). Since they are both less than 0.05, both null hypotheses are rejected.