# Primary data coding procedure for statistical analysis

By Riya Jain & Priya Chetty on July 12, 2021

Research can be based on primary data or secondary data. Though secondary data is easily available from published sources or official websites, primary data is based on respondents’ perceptions and thus has chances of bias. It is essential to avoid biases and errors in data collection. The data needs to be random as it helps derive precise results in statistical analysis. The first step in the process of statistical analysis is primary data coding. This refers to the coding of responses obtained in the survey. Primary data coding has to be:

• Random
• Accurately coded
• Elimination or reduction of missing values

## Primary data coding in MS Excel

Typically, the primary data is coded in MS Excel. This is because Excel is versatile, simple and easy to be imported into other statistical applications for further analysis.

The following steps should be followed for preparing the datasheet.

### Step 1: Defining the number of questions and sub-options in code

Firstly, the questions of the survey questionnaire must be coded. The first line of the excel should mention the question number and the sub-options under it. So, based on the number of questionnaires included in the study, the question number need to be edited while the sub-options could be changed as per the number of statements under each question.

For example; the entire questionnaire, if consists of 10 questions, then the question number in code will start from ‘a’ and end at ‘j’. Among these 10 questions let say 1st question is about Gender with ‘Male’ and ‘Female’ as sub-statements then question [‘a’] would have value 1 for male while 2 for female.

### Step 2: Allotment of values to the sub-options

Next, decide the frequency of responses to every option in every question. For example; in a study examining the perception of luxury cosmetics with a sample of 100, the maximum consumers that should be included in the study would be female. Thus, we can allot 82 numbers to females and 18 to males.

### Step 3: Run a php script to generate sample data to code

``````<?php

//Enter the number of respondents inside with "A","B"... as the choices
//And after "=>" set the frequency of those choices

\$Question["a"] = array(
"6" => 4,
"5" => 6,
"4" => 18,
"3" => 10,
"2" => 4,
"1" => 8
);

\$Question["b"] = array(
"6" => 4,
"5" => 5,
"4" => 6,
"3" => 16,
"2" => 12,
"1" => 7
);

\$Question["c"] = array(
"4" => 4,
"3" => 15,
"2" => 19,
"1" => 12
);

\$Question["d"] = array(
"4" => 0,
"3" => 0,
"2" => 22,
"1" => 28
);

\$Question["e"] = array(
"2" => 12,
"1" => 38
);

\$Question["f"] = array(
"3" => 12,
"2" => 35,
"1" => 3
);
//Do not change anything below

foreach (\$Question as \$col => \$options){

\$column_name[]	= \$col;

\$tot_row[]  	= array_sum(\$options);

foreach(\$options as \$selection => \$row_value){

for(\$i=0;\$i<\$row_value;\$i++){

\$row[] = \$selection;
}

if(empty(\$row))

continue;

shuffle(\$row);

\$value[\$col] = \$row;
}

unset(\$row);
}

\$high = max(\$tot_row);

echo "<table border='0' cellpadding='0'><tr>";

for(\$i=0;\$i<\$high;\$i++){

echo "<tr>";

foreach(\$column_name as \$d){

if(isset(\$value[\$d][\$i])){

echo "<td>".\$value[\$d][\$i]."</td>";

}else{

echo "<td></td>";
}
}

echo "</tr>";
}

echo "</table>";
?>
``````

After assigning the codes, paste and run it in the link http://www.writephponline.com/.

### Step 4: Create the datasheet

The link will create the dataset. Next, copy it and paste it on an empty excel sheet by right-clicking on an empty cell then selecting ‘Paste special’ followed by ‘Unicode text’ on the Excel. This is the final datasheet that can be used for further analysis.

Other than editing question numbers and sub-options, no other changes should be made to the code.

## Primary data coding random the responses generated by PHP script

Let’s say a study needs to be done for understanding the influence of a remote working environment on employee performance in IT companies. For this, the sample selected for the study is 100. Some of the questions from a close-ended questionnaire are: Figure 1: Demographic questions of a sample survey questionnaire

The procedure of data coding is shown below.

``````\$Question["a"] = array(
"6" => 4,
"5" => 6,
"4" => 18,
"3" => 10,
"2" => 4,
"1" => 8
);

\$Question["b"] = array(
"6" => 4,
"5" => 5,
"4" => 6,
"3" => 16,
"2" => 12,
"1" => 7
);

\$Question["c"] = array(
"4" => 4,
"3" => 15,
"2" => 19,
"1" => 12
);``````

The frequency mentioned above has three questions. Therefore the code will include only Question a, b, and c as shown in the figure above. Similarly, the options of each question must be adjusted as per the questionnaire. For example, in table 1 above, Q.1 has 5 options. Therefore Q.a would also have only 5 options. Q.2. has 2 options thus Q.b. would have 2 options.

The next step is to decide the frequency of each option. For example, there are likely to be more male employees in the present case of IT companies than females. Similarly, most employees are likely to be young. Therefore the frequency will be decided accordingly (more responses for ‘male= 1’ and ’20-30 years = 2’ age group. This is shown in the figure below. Figure 3: Deciding the frequency of each option in the questions for primary data coding

Next, paste the code on http://www.writephponline.com/ and click on ‘Run code’ as shown below.

Results derived from running the code is shown below.

Copy the results shown in Figure 6 and paste into an empty excel sheet by right-clicking on any empty cell, select ‘Paste Special’ and then ‘Unicode Text’. This step is shown in figures 6 and 7 below.