How to process the primary dataset for a regression analysis?
Regression analysis signifies the extent of the relationship between the dependent and independent variables. Randomness and biases are inherent in primary data, i.e. data collected from first-hand sources through surveys. The previous article has discussed in detail the relevance of significant regression results. It also showed that in order to remove variability in data, data processing is required. This article demonstrates with an example of how to reduce variability in a dataset in order to generate a ‘significant’ regression value. Before beginning to process the dataset, an initial regression test is important.
Processing the regression dataset before the analysis
For this article, let’s consider the same example used in the correlation article. The aim was to determine the impact of personal factors (independent variables) on women’s leadership skills (dependent variable). Suppose from the literature review it was found that two factors that affect women’s leadership are:
- Competence of the leader
- Creative and initiative ability
The first step in conducting the regression test is to check for correlation. The previous article already established a moderate correlation between the dependent and independent variables. Therefore we can now proceed to the regression test. The following hypothesis can be framed:
H0 (null hypothesis): There is no significant impact of personal factors on the women leadership.
HA (alternate hypothesis): There is a significant impact of personal factors on the women leadership.
Results of the regression test are shown below.
Women Leadership | Coefficient | T-statistic | p-value | R2 | Adjusted R2 | FRatio |
---|---|---|---|---|---|---|
Constant | 1.085 | 2.748 | .011 | .634 | .607 | 23.365 |
Competence of leader | .425 | 2.408 | .023 | |||
Creativity and initiative ability of a leader | .343 | 1.909 | .067 |
The value of R2 and Adjusted R2 in the above model is 0.634 and 0.607, thus showing about 60.7% of the variation in the value of leadership is explained by the independent variables. Furthermore, F-ratio also signifies the appropriate prediction of women leadership based on the two independent variables. But the p-value of ‘creativity and initiative ability of a leader’ is greater than the required level of the significance (0.05). Therefore the dataset needs to be processed.
Note: – It is not always essential to have significant results for all the independent variables. In the case of a large number of independent variables, some of them can be left as ‘insignificant’. In the present case, we will process the dataset of ‘creativity and initiative ability of a leader’ to make the result significant.
Processing of dataset to remove variability from the sample
The original dataset is as shown below.
Values of the observations in the independent variable should be brought closer to the values of the dependent variable. For example, 8th row shows different values for dependent and independent variable, therefore this difference is reduced by changing the independent variable (creativity and competence) value to 2. The modification of the dataset is shown in the below figure.
In case of competence of a leader, value at 4th row is modified from 5 to 4 or 10th-row value is replaced from 1 to 2.
Results of the regression analysis
After completing the processing of the dataset, the regression analysis is performed. Results are shown in the below table.
Women Leadership | Coefficient | T-statistic | p-value | R2 | Adjusted R2 | F Ratio |
---|---|---|---|---|---|---|
Constant | .374 | .883 | .385 | .703 | .681 | 31.966 |
Competence of leader | .265 | 2.133 | .042 | |||
Creativity and initiative ability of a leader | .687 | 5.216 | .000 |
Above table clearly depicts that the relationship between the variables has improved by reducing the variability. By comparing the R2 and Adjusted R2, it could be seen that value has increased from 0.634 and 0.607 to 0.703 and 0.681 respectively. Adjusted R2 shows that about 78.1% of the variation in the dependent variable is now explained by the independent variables.
Furthermore, F-ratio though initially was also greater than 1 but now the value has increased further from 23.365 to 31.966. The F-ratio test thus states that prediction of the creativity level of students has improved by considering emotional awareness factors i.e. not hiding emotions and analyzing emotions as the independent variable.
Finally, the hypothesis test reveals an acceptable p-value of both independent variables (0.042 and 0.000). Since they are both less than 0.05, both null hypotheses are rejected.
I am a management graduate with specialisation in Marketing and Finance. I have over 12 years' experience in research and analysis. This includes fundamental and applied research in the domains of management and social sciences. I am well versed with academic research principles. Over the years i have developed a mastery in different types of data analysis on different applications like SPSS, Amos, and NVIVO. My expertise lies in inferring the findings and creating actionable strategies based on them.
Over the past decade I have also built a profile as a researcher on Project Guru's Knowledge Tank division. I have penned over 200 articles that have earned me 400+ citations so far. My Google Scholar profile can be accessed here.
I now consult university faculty through Faculty Development Programs (FDPs) on the latest developments in the field of research. I also guide individual researchers on how they can commercialise their inventions or research findings. Other developments im actively involved in at Project Guru include strengthening the "Publish" division as a bridge between industry and academia by bringing together experienced research persons, learners, and practitioners to collaboratively work on a common goal.
I am a Senior Analyst at Project Guru, a research and analytics firm based in Gurugram since 2012. I hold a master’s degree in economics from Amity University (2019). Over 4 years, I have worked on worked on various research projects using a range of research tools like SPSS, STATA, VOSViewer, Python, EVIEWS, and NVIVO. My core strength lies in data analysis related to Economics, Accounting, and Financial Management fields.
Discuss