How to process the primary dataset for a regression analysis?

By Riya Jain & Priya Chetty on March 3, 2020

Regression analysis signifies the extent of the relationship between the dependent and independent variables. Randomness and biases are inherent in primary data, i.e. data collected from first-hand sources through surveys. The previous article has discussed in detail the relevance of significant regression results. It also showed that in order to remove variability in data, data processing is required. This article demonstrates with an example of how to reduce variability in a dataset in order to generate a ‘significant’ regression value. Before beginning to process the dataset, an initial regression test is important.

Processing the regression dataset before the analysis

For this article, let’s consider the same example used in the correlation article. The aim was to determine the impact of personal factors (independent variables) on women’s leadership skills (dependent variable). Suppose from the literature review it was found that two factors that affect women’s leadership are:

Competence of the leader
Creative and initiative ability

The first step in conducting the regression test is to check for correlation. The previous article already established a moderate correlation between the dependent and independent variables. Therefore we can now proceed to the regression test. The following hypothesis can be framed:

H₀ (null hypothesis): There is no significant impact of personal factors on the women leadership.
H_A (alternate hypothesis): There is a significant impact of personal factors on the women leadership.

Results of the regression test are shown below.

Women Leadership	Coefficient	T-statistic	p-value	R²	Adjusted R²	FRatio
Constant	1.085	2.748	.011	.634	.607	23.365
Competence of leader	.425	2.408	.023
Creativity and initiative ability of a leader	.343	1.909	.067

Table 1: Original Regression results

The value of R² and Adjusted R² in the above model is 0.634 and 0.607, thus showing about 60.7% of the variation in the value of leadership is explained by the independent variables. Furthermore, F-ratio also signifies the appropriate prediction of women leadership based on the two independent variables. But the p-value of ‘creativity and initiative ability of a leader’ is greater than the required level of the significance (0.05). Therefore the dataset needs to be processed.

Note: – It is not always essential to have significant results for all the independent variables. In the case of a large number of independent variables, some of them can be left as ‘insignificant’. In the present case, we will process the dataset of ‘creativity and initiative ability of a leader’ to make the result significant.

Processing of dataset to remove variability from the sample

The original dataset is as shown below.

Figure 1: Original dataset for regression analysis

Values of the observations in the independent variable should be brought closer to the values of the dependent variable. For example, 8^th row shows different values for dependent and independent variable, therefore this difference is reduced by changing the independent variable (creativity and competence) value to 2. The modification of the dataset is shown in the below figure.

Figure 2: A modified dataset for regression analysis

In case of competence of a leader, value at 4^th row is modified from 5 to 4 or 10th-row value is replaced from 1 to 2.

Results of the regression analysis

After completing the processing of the dataset, the regression analysis is performed. Results are shown in the below table.

Women Leadership	Coefficient	T-statistic	p-value	R²	Adjusted R²	F Ratio
Constant	.374	.883	.385	.703	.681	31.966
Competence of leader	.265	2.133	.042
Creativity and initiative ability of a leader	.687	5.216	.000

Table 2: Final Regression results

Above table clearly depicts that the relationship between the variables has improved by reducing the variability. By comparing the R² and Adjusted R², it could be seen that value has increased from 0.634 and 0.607 to 0.703 and 0.681 respectively. Adjusted R²shows that about 78.1% of the variation in the dependent variable is now explained by the independent variables.

Furthermore, F-ratio though initially was also greater than 1 but now the value has increased further from 23.365 to 31.966. The F-ratio test thus states that prediction of the creativity level of students has improved by considering emotional awareness factors i.e. not hiding emotions and analyzing emotions as the independent variable.

Finally, the hypothesis test reveals an acceptable p-value of both independent variables (0.042 and 0.000). Since they are both less than 0.05, both null hypotheses are rejected.

Processing the regression dataset before the analysis

Processing of dataset to remove variability from the sample

Results of the regression analysis

Discuss

1 thought on “How to process the primary dataset for a regression analysis?”

proofreading