# Primary data coding procedure for statistical analysis

By Riya Jain & Priya Chetty on July 12, 2021

Research can be based on primary data or secondary data. Though secondary data is easily available from published sources or official websites, primary data is based on respondents’ perceptions and thus has a chance of bias. It is essential to avoid biases and errors in data collection. The data needs to be random as it helps derive precise results in statistical analysis. The first step in the process of statistical analysis is primary data coding. This refers to the coding of responses obtained in the survey. Primary data coding has to be:

• Random
• Accurately coded
• Elimination or reduction of missing values

## Primary data coding in MS Excel

Typically, the primary data is coded in MS Excel. This is because Excel is versatile, simple and easy to be imported into other statistical applications for further analysis.

The following steps should be followed for preparing the datasheet.

### Step 1: Defining the number of questions and sub-options in the code

Firstly, the questions of the survey questionnaire must be coded. The first line of the excel should mention the question number and the sub-options under it. So, based on the number of questionnaires included in the study, the question number needs to be edited while the sub-options could be changed as per the number of statements under each question.

For example; the entire questionnaire, if consists of 10 questions, then the question number in the code will start with ‘a’ and end with ‘j’. Among these 10 questions let’s say 1st question is about Gender with ‘Male’ and ‘Female’ as sub-statements then question [‘a’] would have a value of 1 for male while 2 for female.

### Step 2: Allotment of values to the sub-options

Next, decide the frequency of responses to every option in every question. For example; in a study examining the perception of luxury cosmetics with a sample of 100, the maximum number of consumers that should be included in the study would be female. Thus, we can allot 82 numbers to females and 18 to males.

### Step 3: Run a PHP script to generate sample data to code

``````<?php

//Enter the number of respondents inside with "A","B"... as the choices
//And after "=>" set the frequency of those choices

\$Question["a"] = array(
"6" => 4,
"5" => 6,
"4" => 18,
"3" => 10,
"2" => 4,
"1" => 8
);

\$Question["b"] = array(
"6" => 4,
"5" => 5,
"4" => 6,
"3" => 16,
"2" => 12,
"1" => 7
);

\$Question["c"] = array(
"4" => 4,
"3" => 15,
"2" => 19,
"1" => 12
);

\$Question["d"] = array(
"4" => 0,
"3" => 0,
"2" => 22,
"1" => 28
);

\$Question["e"] = array(
"2" => 12,
"1" => 38
);

\$Question["f"] = array(
"3" => 12,
"2" => 35,
"1" => 3
);
//Do not change anything below

foreach (\$Question as \$col => \$options){

\$column_name[]	= \$col;

\$tot_row[]  	= array_sum(\$options);

foreach(\$options as \$selection => \$row_value){

for(\$i=0;\$i<\$row_value;\$i++){

\$row[] = \$selection;
}

if(empty(\$row))

continue;

shuffle(\$row);

\$value[\$col] = \$row;
}

unset(\$row);
}

\$high = max(\$tot_row);

for(\$i=0;\$i<\$high;\$i++){

echo "<tr>";

foreach(\$column_name as \$d){

if(isset(\$value[\$d][\$i])){

echo "<td>".\$value[\$d][\$i]."</td>";

}else{

echo "<td></td>";
}
}

echo "</tr>";
}

echo "</table>";
?>
``````

After assigning the codes, paste and run them in the link http://www.writephponline.com/.

### Step 4: Create the datasheet

The link will create the dataset. Next, copy and paste it on an empty excel sheet by right-clicking on an open cell and then selecting ‘Paste special’ followed by ‘Unicode text’ on the Excel. This is the final datasheet that can be used for further analysis.

Other than editing question numbers and sub-options, no other changes should be made to the code.

## Primary data coding random the responses generated by PHP script

Let’s say a study needs to be done for understanding the influence of a remote working environment on employee performance in IT companies. For this, the sample selected for the study is 100. Some of the questions from a close-ended questionnaire are:

The procedure of data coding is shown below.

``````\$Question["a"] = array(
"6" => 4,
"5" => 6,
"4" => 18,
"3" => 10,
"2" => 4,
"1" => 8
);

\$Question["b"] = array(
"6" => 4,
"5" => 5,
"4" => 6,
"3" => 16,
"2" => 12,
"1" => 7
);

\$Question["c"] = array(
"4" => 4,
"3" => 15,
"2" => 19,
"1" => 12
);``````

The frequency mentioned above has three questions. Therefore the code will include only Questions a, b, and c as shown in the figure above. Similarly, each question’s options must be adjusted per the questionnaire. For example, in table 1 above, Q.1 has 5 options. Therefore Q.a would also have only 5 options. Q.2. has 2 options thus Q.b. would have 2 options.

The next step is to decide the frequency of each option. For example, there are likely to be more male employees in the present case of IT companies than females. Similarly, most employees are likely to be young. Therefore the frequency will be decided accordingly (more responses for the ‘male= 1’ and ’20-30 years = 2’ age groups. This is shown in the figure below.

Next, paste the code on www.writephponline.com/ and click on ‘Run code’ as shown below.

Results derived from running the code are shown below.

Copy the results shown in Figure 6 and paste them into an empty excel sheet by right-clicking on any empty cell, selecting ‘Paste Special’ and then ‘Unicode Text’. This step is shown in figures 6 and 7 below.

The final datasheet will look as below.

This procedure can be followed only for close-ended questionnaires. It is important to note that each statement in every section, i.e. demographic, general background and inferential section would be treated as individual questions. For example, in the inferential section:

In the above figure, both statements data would be collected individually representing them as 2 questions.

Priya is the co-founder and Managing Partner of Project Guru, a research and analytics firm based in Gurgaon. She is responsible for the human resource planning and operations functions. Her expertise in analytics has been used in a number of service-based industries like education and financial services.

Her foundational educational is from St. Xaviers High School (Mumbai). She also holds MBA degree in Marketing and Finance from the Indian Institute of Planning and Management, Delhi (2008).

Some of the notable projects she has worked on include:

• Using systems thinking to improve sustainability in operations: A study carried out in Malaysia in partnership with Universiti Kuala Lumpur.
• Assessing customer satisfaction with in-house doctors of Jiva Ayurveda (a project executed for the company)
• Predicting the potential impact of green hydrogen microgirds (A project executed for the Government of South Africa)

She is a key contributor to the in-house research platform Knowledge Tank.

She currently holds over 300 citations from her contributions to the platform.

She has also been a guest speaker at various institutes such as JIMS (Delhi), BPIT (Delhi), and SVU (Tirupati).

I am a master's in Economics from Amity University. Having a keen interest in Econometrics and data analysis, I was a part of the Innovation Project of Daulat Ram College, Delhi University. My core expertise and interest are in environment-related issues. Apart from academics, I love music and exploring new places.