How to improve the correlation between the variables?

By Riya Jain & Priya Chetty on February 27, 2020
Photo by energepic.com from Pexels

Correlation refers to the extent of a relationship between the variables. In order to determine this relationship, it is essential to establish a correlation between the variables. For example, a researcher intends to find out how personal factors affect the success of women leaders. For this, the researcher needs to identify the determinants (variables) of personal factors like:

  • needs,
  • marital status,
  • age,
  • confidence,
  • skills,
  • innovativeness, etc.

This will then need to be tested individually or collectively against their success.

A previous article showed the process of conducting a correlation test. It also showed the condition required for proving a significant linkage between the dependent and independent variables. However, the dataset is not always perfect. When there are many outliers in the dataset or inconsistencies, a correlation cannot be determined. This is when the dataset needs processing.

This article shows how to process a dataset so that these inconsistencies can be removed, and correlation is proven between variables.

What influences the significance (p) value?

The significance (p) value of Pearson’s correlation test value reveals that at least there is the existence of a moderate linear relationship between the variables. The coefficient value of correlation is dependent on the difference between the observations of variables, thus the essential requirement for the derivation of a high correlation value is:

  • To maximize the difference between the variables or factors (or dependent and independent variables) in case of deriving a negative relationship
  • To minimize the difference between the factors or variables (or dependent and independent variables) for obtaining a positive relationship

Why use data processing?

Many a-times the value of correlation derived from the primary data is ‘perfect’, i.e 1 or -1. A Pearson correlation coefficient value of 1 or -1 signifies that there is an existence of a perfect positive or negative relationship between two variables. In the primary research, data is based on human perception of issues. Due to the presence of randomness and biases in human behaviour, a perfect correlation between the two variables is practically impossible. Some of the reasons for this bias include:

  • Acquiescence or Friendliness bias: giving response just for completing the survey,
  • Social desirability or Acceptability bias: answers to sensitive or personal questions based on social desirability,
  • Habituation bias: the same answer to similarly worded questions, or
  • Sponsor bias: influence of answer from the reputation or mission of the researcher.

Hence, there is a need to reduce or increase the correlation value. This is done through data processing.

What is the procedure of data processing?

In order to process the dataset, it is initially recommended to check the Pearson correlation coefficient value in MS Excel before processing the dataset with any other software for analysis.

Step 1: Determining the correlation value

Apply the Pearson correlation coefficient formulae and determine the value of linkage between the variables:

Formula = PEARSON(Array 1, Array 2)

OR

Formula = PEARSON(Dependent variable array, Independent variables array)

For example: following the above-stated example, the Pearson value for the women leadership (dependent variable) and other independent variables is computed as shown below.

Pearson Coefficient Value computation in MS Excel
Figure 1: Pearson Coefficient Value computation in MS Excel

Step 2: Changing values of multiple independent variables

Note: In the case of many independent variables fix the dependent variable array and change the independent variable column by using $ as shown below.

Pearson Coefficient value
Figure 2: Pearson Coefficient value

Now, drag the cursor from the extreme right corner towards other independent variables. The Pearson coefficient value for each linkage is shown in the below figure.

Computed correlation coefficient values of all independent variables
Figure 3: Computed correlation coefficient values of all independent variables

As the above figure shows that the value of 1st (Personality of a leader) is 0.029, 2nd (individual needs of a leader) is approximately -0.066, and 4th variable (confidence and courage of leader) is -1. In contrast, the value for the 3rd variable (competence of leader) is too high i.e. 1. This data processing needs to be done for these 4 variables while the 5th variable (creativity and initiative ability of a leader) linkage is appropriate.

Step 3: Processing the data

Case 1 (a): When the coefficient value is less than the moderate positive relationship value i.e. r < |0.5|

Figure 3 above shows a very large difference between the values of dependent and independent variables, like 5 and 1, or 4 and 1. In order to reduce this difference replace some of the observation values of the independent variable with the value close to the dependent variable’s.

EXAMPLE

In the case of the 3rd row and B column, the dependent variable value is 5 while the independent variable value is 1 (figure 3), thus, 1 is replaced by 4 to reduce the difference (figure 4 below).

When these values are changed, the coefficient automatically changes. This process is repeated until at least a 0.5 value of the Pearson coefficient is derived.

Modified correlation coefficient results showing a positive relationship
Figure 4: Modified correlation coefficient results showing a positive relationship

This process is repeated for other independent variables that have a correlation value less than |0.5|.

Case 1 (b): When the coefficient value is less than the moderate negative relationship value i.e. r < |0.5|

On the other hand, when the correlation coefficient value is negative, i.e. less than 0 then two variables are said to be negatively correlated. To improve this correlation, increase the difference between the variables. This is done by identifying the independent variable observation, which is identical or close to the dependent observation value, and replacing it with the value which would increase the difference between the variables.

EXAMPLE

The correlation coefficient value of the second variable (Individual needs of a leader) is -0.06. The difference between the observations of the 6th row and the C column is less i.e. dependent variable value is 4 while the independent variable value is 5. Now, to increase the difference, 5 is replaced by 1. This would reduce the correlation between the variables. The result is now 0.63.

Follow this process for all variables showing a coefficient of less than -0.05.

Modified negative correlation coefficient result
Figure 5: Modified negative correlation coefficient result
NOTE

It is not essential to derive a moderate linkage between all the variables. In the case of very few factors, the correlation value of all the factors is processed while in the case of a large number of factors, only a few independent variables’ observation values are processed.

Case 2 (a): When the coefficient value is perfectly positive i.e. r = |1|

Figure 5 above shows that there is a perfect linkage between the variables i.e. women’s leadership and competence of leaders i.e. the values of dependent and independent variables are the same. The coefficient value is 1. It implies that the data is biased. In order to remove the bias, reduce the coefficient value by increasing the difference between the two variables’ observations.

EXAMPLE

The observation value of the 3rd row and D column is 5 for the dependent and independent variables. The value of independent variables is replaced by 1. This would reduce the value of the correlation between the variables.

This process is repeated until the correlation value is reduced to 0.8 or 0.9.

Modified correlation coefficient results for a perfectly positive correlation.
Figure 2: Modified correlation coefficient results in a perfectly positive correlation.

Case 2 (b): When the coefficient value is perfectly negative i.e. r = |-1|

The other way of reducing the correlation between the variables is by reducing the difference between the variables. In this case, the value of the independent variable is replaced by the value close to the dependent variable value.

EXAMPLE

Figure 6 above shows that the observed value in the 3rd row and E column shows that the value of the dependent variable is 5 while the value of independent variables is 1. Thus, to reduce the correlation between the variables, the value is replaced by 4. This brings the correlation coefficient value to 0.76.

When the coefficient value is perfectly negative i.e. r = |-1|
Figure 7: When the coefficient value is perfectly negative i.e. r = |-1|

The above-stated process is repeated until adequate linkage between the variables is derived.

Performing the correlation analysis in SPSS

The final step in determining the linkage between the dependent and independent variables is to statistically analyze the dataset. The procedure used for Pearson correlation analysis in SPSS is followed and finally, a significant conclusion could be drawn by interpreting the results derived from the analysis.

Dependent Variable Variables Women Leadership (Dependent) The personality of a leader Individual needs of a leader Competence of the leader Confidence and courage of the leader Creativity and initiative ability of a leader
Women Leadership (Dependent)     Pearson Coefficient 1 .541** -.636** .764** -.766** .745**
Significance   .002 .000 .000 .000 .000

Table 1: Correlation Analysis results

The above analysis thus shows that as the significance value of all the variables is less than the significance level of the study thus, there exists a linear relationship between women’s leadership and the confidence and courage of the leader.

Priya is the co-founder and Managing Partner of Project Guru, a research and analytics firm based in Gurgaon. She is responsible for the human resource planning and operations functions. Her expertise in analytics has been used in a number of service-based industries like education and financial services.

Her foundational educational is from St. Xaviers High School (Mumbai). She also holds MBA degree in Marketing and Finance from the Indian Institute of Planning and Management, Delhi (2008).

Some of the notable projects she has worked on include:

  • Using systems thinking to improve sustainability in operations: A study carried out in Malaysia in partnership with Universiti Kuala Lumpur.
  • Assessing customer satisfaction with in-house doctors of Jiva Ayurveda (a project executed for the company)
  • Predicting the potential impact of green hydrogen microgirds (A project executed for the Government of South Africa)

She is a key contributor to the in-house research platform Knowledge Tank.

She currently holds over 300 citations from her contributions to the platform.

She has also been a guest speaker at various institutes such as JIMS (Delhi), BPIT (Delhi), and SVU (Tirupati).

 

I am a master's in Economics from Amity University. Having a keen interest in Econometrics and data analysis, I was a part of the Innovation Project of Daulat Ram College, Delhi University. My core expertise and interest are in environment-related issues. Apart from academics, I love music and exploring new places.

Discuss