# Basic terms of statistics

“Statistical Product and Service Solutions” is abbreviated as SPSS and was first developed by Norman H. Nie, Dale H. Bent and C. Hadlai Hull in year 1968. The software is used to conduct statistical analysis on the sample of larger population. Some of the terms used in statistics for sample analysis in this software are explained below:

## The mean (average)

Mean is the average of all values in a particular column. It is generally represented by µ (Babraham Bioinformatics , n.d.).

µ= ∑(X)/N

(Where, ∑= sum of, X= Individual data points, and N= Sample size)

Also known as ‘average’, it is the most common statistical tool by which the researchers can calculate population mean when conducting quantitative analysis of sample. Since mean is influenced by outliers (very small or very large number of values) therefore cannot be a fair representation of the data. For example, we calculate the mean for English Marks of Students in a sample of 100 respondents. If most of the students have earned nearly 60% marks then the average mean would be influenced by both high graders (more than 80%) and low graders (i.e. less than 20%) which would affect the overall mean value.

## The median

Median is a numerical value which separates the higher half of the sample from the lower half. In lay man terms it is the middle value in the sample or probability distribution (Weisstein, n.d.). Median is generally applied in situations where the researcher cannot get proper measurement and therefore ranks the data in order. For example, when ranking the performance of students in class the student who is middle in the class would represent the median performance of the class.

## The variance

Variance is the measure of how far the numbers in the sample are set out. For example if the variance in the sample is zero then that indicates that the values are identical. For illustration, a sample of 100 respondents was asked about chocolate brand they like the most. So, if all the respondents chose same brand there is no variance, however if some respondents chose other brands as well then it represents variance. A small variance indicates that the values are close to mean (Weisstein, n.d.).

Variance (S2): µ/N-1

When we see variability in sample data, there would be reasons behind this Variability. For example; when a sample population is asked why they prefer to going to a particular store of shopping, they would either give variable reasons, or similar reasons. This would enable the researcher to present findings related to that sample, on the basis of which they can suggest recommendations to stores which have low sales. Therefore, to analyse variability, research needs to find out if something important has happened. Variance allows the researcher to answer these questions.

## The standard deviation

The square root of variance is called standard deviation. It is denoted by SD (Weisstein, n.d.). It measures the variability in the sample and determines the relationship of the mean to the rest of the data. If the responses given by the sample are close to mean, then it reflects uniformity in the data, and therefore the value for standard deviation would be small, and similarly vice versa is also true. In case all the value are same then the Standard Deviation value will be zero. The standard deviation can be calculated using the following formula.

Where, S= Standard Deviation, ∑= Sum of, X= Each value in data set, X, mean of all values in data set and n= number of samples in data set.

Standard deviation is also used to compare two sets of data effectively, for example data set 1 includes; 1, 3, 5 and data set 2 includes; 0, 3, 6. Now the average mean of the two data sets is same (3) however the standard deviation is different s(1)= 2 and s(2)= 3. Without standard deviation, researcher cannot handle data which is close to average.

## Confidence interval

The confidence interval quantifies the uncertainty in the measurement. When the mean of a sample is calculated, it may not represent the true population mean and the discrepancy in the value is dependent on the variability (change in responses among the respondents) of the values and the sample size (denoted as N, is a part of larger population). Therefore one has to combine these two in order to calculate 95% or 98% confidence interval. At 95% confidence interval, reflects that the risk of being wrong is 5% and in case of 98% confidence interval, reflects that the risk of being wrong is 2%. This interval would represent true population mean (Weisstein, n.d.).

## The mode

The mode value is the value in a distribution with the highest frequency. It can be calculated using nominal data (Argyrous, 2013). Mode is normally used for categorical data wherein the researcher wants to find out about the most common category.

## Range

The Range of a set of a data is the difference between the largest and the smallest values (Woodbury, 2001). It is also used to define the variability in a sample or population. It helps in understanding how well mean can represent the data. So, if the studied variable has critical low or high threshold then that threshold should not be crossed. For example, if the researcher is conducting an experiment to study average mileage of 10 cars, the maximum value would be the highest mileage and lowest value would be lowest mileage.

## Interquartile range (IQR)

Interquartile range is the measure of statistical variation and is equal to the difference between the upper and the lower quartile (which divides the sample into 4 equal parts) (Upton and Cook, 1996). First or Lower quartile (Q1) is the middle number between first number and median and Third or Upper quartile (Q3) is the middle number between last value and median. The median is also called as the Second Quartile (Q2).

It is similar to range and the only difference is that it is applicable to large amount of data. It is also significant because it is not sensitive to outliers which affects other measures discussed above.

IQR: Q– Q1

References

### Priya Chetty

Partner at Project Guru
Priya is a master in business administration with majors in marketing and finance. She is fluent with data modelling, time series analysis, various regression models, forecasting and interpretation of the data. She has assisted data scientists, corporates, scholars in the field of finance, banking, economics and marketing.