Primary data coding procedure for statistical analysis

By Riya Jain & Priya Chetty on July 12, 2021

Research can be based on primary data or secondary data. Though secondary data is easily available from published sources or official websites, primary data is based on respondents’ perceptions and thus has a chance of bias. It is essential to avoid biases and errors in data collection. The data needs to be random as it helps derive precise results in statistical analysis. The first step in the process of statistical analysis is primary data coding. This refers to the coding of responses obtained in the survey. Primary data coding has to be:

Random
Accurately coded
Elimination or reduction of missing values

Data coding randomly generated responses

Primary data coding in MS Excel

Typically, the primary data is coded in MS Excel. This is because Excel is versatile, simple and easy to be imported into other statistical applications for further analysis.

The following steps should be followed for preparing the datasheet.

Figure 1: Primary data collection procedure

Step 1: Defining the number of questions and sub-options in the code

Firstly, the questions of the survey questionnaire must be coded. The first line of the excel should mention the question number and the sub-options under it. So, based on the number of questionnaires included in the study, the question number needs to be edited while the sub-options could be changed as per the number of statements under each question.

For example; the entire questionnaire, if consists of 10 questions, then the question number in the code will start with ‘a’ and end with ‘j’. Among these 10 questions let’s say 1^st question is about Gender with ‘Male’ and ‘Female’ as sub-statements then question [‘a’] would have a value of 1 for male while 2 for female.

Step 2: Allotment of values to the sub-options

Next, decide the frequency of responses to every option in every question. For example; in a study examining the perception of luxury cosmetics with a sample of 100, the maximum number of consumers that should be included in the study would be female. Thus, we can allot 82 numbers to females and 18 to males.

Step 3: Run a PHP script to generate sample data to code

<?php

//Enter the number of respondents inside with "A","B"... as the choices
//And after "=>" set the frequency of those choices

$Question["a"] = array(
				"6" => 4,
				"5" => 6, 
				"4" => 18,
				"3" => 10,
				"2" => 4,
				"1" => 8
				);


$Question["b"] = array(	
                "6" => 4,
				"5" => 5,				
				"4" => 6,
				"3" => 16,
				"2" => 12,
				"1" => 7
				);

$Question["c"] = array(	
				"4" => 4,
				"3" => 15,
				"2" => 19,
				"1" => 12
				);

$Question["d"] = array(
				"4" => 0,
				"3" => 0,
				"2" => 22,
				"1" => 28
				);

$Question["e"] = array(	            	           
				"2" => 12,
				"1" => 38	            
				);

$Question["f"] = array(	            	            	            
				"3" => 12,
				"2" => 35,
				"1" => 3            
				);
//Do not change anything below

foreach ($Question as $col => $options){
	
	$column_name[]	= $col;
	
	$tot_row[]  	= array_sum($options);

	foreach($options as $selection => $row_value){
		
		for($i=0;$i<$row_value;$i++){
		
			$row[] = $selection;
		}
		
		if(empty($row))
			
			continue;
		
		shuffle($row);
		
		$value[$col] = $row;
	}
	
	unset($row);
}

$high = max($tot_row);

echo "<table border='0' cellpadding='0'><tr>";
		
for($i=0;$i<$high;$i++){

			
	echo "<tr>";
		
	foreach($column_name as $d){
			
		if(isset($value[$d][$i])){
				
			echo "<td>".$value[$d][$i]."</td>";
			
		}else{
				
			echo "<td></td>";
		}
	}
			
	echo "</tr>";
}
	
echo "</table>";
?>

After assigning the codes, paste and run them in the link http://www.writephponline.com/.

Step 4: Create the datasheet

The link will create the dataset. Next, copy and paste it on an empty excel sheet by right-clicking on an open cell and then selecting ‘Paste special’ followed by ‘Unicode text’ on the Excel. This is the final datasheet that can be used for further analysis.

Other than editing question numbers and sub-options, no other changes should be made to the code.

Primary data coding random the responses generated by PHP script

Let’s say a study needs to be done for understanding the influence of a remote working environment on employee performance in IT companies. For this, the sample selected for the study is 100. Some of the questions from a close-ended questionnaire are:

Figure 1: Demographic questions of a sample survey questionnaire

The procedure of data coding is shown below.

$Question["a"] = array(
				"6" => 4,
				"5" => 6, 
				"4" => 18,
				"3" => 10,
				"2" => 4,
				"1" => 8
				);


$Question["b"] = array(	
                                "6" => 4,
				"5" => 5,				
				"4" => 6,
				"3" => 16,
				"2" => 12,
				"1" => 7
				);

$Question["c"] = array(	
				"4" => 4,
				"3" => 15,
				"2" => 19,
				"1" => 12
				);

The frequency mentioned above has three questions. Therefore the code will include only Questions a, b, and c as shown in the figure above. Similarly, each question’s options must be adjusted per the questionnaire. For example, in table 1 above, Q.1 has 5 options. Therefore Q.a would also have only 5 options. Q.2. has 2 options thus Q.b. would have 2 options.

The next step is to decide the frequency of each option. For example, there are likely to be more male employees in the present case of IT companies than females. Similarly, most employees are likely to be young. Therefore the frequency will be decided accordingly (more responses for the ‘male= 1’ and ’20-30 years = 2’ age groups. This is shown in the figure below.

Figure 3: Deciding the frequency of each option in the questions for primary data coding

Next, paste the code on www.writephponline.com/ and click on ‘Run code’ as shown below.

Figure 4: Running the code on the PHP link

Results derived from running the code are shown below.

Copy the results shown in Figure 6 and paste them into an empty excel sheet by right-clicking on any empty cell, selecting ‘Paste Special’ and then ‘Unicode Text’. This step is shown in figures 6 and 7 below.

Figure 7: ‘Unicode text’ command for pasting PHP data in Excel

The final datasheet will look as below.

This procedure can be followed only for close-ended questionnaires. It is important to note that each statement in every section, i.e. demographic, general background and inferential section would be treated as individual questions. For example, in the inferential section:

Figure 9: Independent statements 5-point Likert scale

In the above figure, both statements data would be collected individually representing them as 2 questions.