Common pipeline for statistical analysis in epidemiological studies

By Avishek Majumder & Chandrika Kapagunta on October 1, 2017

Previous articles discussed the need for statistical analysis and modelling of epidemiological studies in public health studies. The analysis in epidemiological studies typically requires descriptive and analytical methods. Descriptive analysis of epidemiological data includes hypotheses development. This is based on the variability of disease outcome rates with demographic variables. On the other hand, analytical epidemiology determines the cause or mode of disease epidemic outbreak. In general, both descriptive and analytical studies are undertaken together. This is because the results of descriptive analysis offer clues for hypotheses development and testing in analytical studies. In this article, the common steps in conducting data analysis of epidemiological studies are explained. Moreover, statistical models to study factors associated with epidemic diseases are explained.

Steps in conducting statistical analysis in epidemiology

There are several common steps in conducting modelling studies of diseases. They are modified according to the study context or data. These steps include a descriptive analysis of the data, hypothesis development, model selection and hypothesis testing (Dicker, Coronado, Koo, & Parrish, 2012). Each of these steps serves a purpose and is interlinked. Each step influences the next step. The first step is to establish the existence of an outbreak and its diagnosis. The correct nature of the disease is confirmed in this step. Outbreak or epidemic is defined as the occurrence of cases of diseases in excess in a defined community, geographical area or season (WHO 2017). Once confirmed, the researchers collect data through active or passive biosurveillance. It involves several types of information collection. This is presented in Figure 1 below.

Types of information collected during biosurveillance. (Source: Dicker, Coronado, Koo, & Parrish, 2012)
Types of information collected during biosurveillance. (Source: Dicker, Coronado, Koo, & Parrish, 2012)

After data collection, the next step is to perform data analysis. This step helps answer several questions about the disease. These questions include, who, where, when, what, how and why about the disease. It also seeks to understand its agent or pathogen, hosts affected and the associated factors. The four main steps in the data analysis of epidemiology data are (Dicker et al. 2012):

Steps in Epidemiological Investigation (Source: Dicker et al., 2012)
Steps in Epidemiological Investigation (Source: Dicker et al., 2012)

Step 1: Descriptive epidemiology

In this step, the aim is to describe the characteristics and trends of the epidemic. It defines the person, place and time. It is referred to as ‘epidemiological triad for descriptive epidemiology. This includes the summarization of the data on the basis of variables. It reveals the following:

  1. Trends of disease prevalence and spread over time.
  2. Geographical distribution of the disease.
  3. Populations or communities or groups affected by the disease

A major outcome of this step is identifying the at-risk population for the disease. Knowledge of all these aspects will help in developing a hypothesis. The hypothesis is based on the cause or mode of a disease outbreak. The common statistical analysis includes:

  • measures of central tendencies and spread,
  • trend analysis,
  • clustering analysis
  • and association measures.

Step 2: Hypothesis development

Based on the data summary of trends in time, place and person of a disease, a hypothesis is developed. It addresses either or all of the following questions:

  1. What is the causative agent of the disease (pathogen)?
  2. What is the mode of transmission of the disease (vehicle or vector)?
  3. Which exposures caused the disease outbreak?

The hypothesis is developed depending upon the background information of the disease, with respect to the host, agent and environment. It is referred to as ‘epidemiological triad for analytical epidemiology’.

Step 3: Analytical test or model selection

Depending on the data characteristics, included variables and the objective or hypothesis, a suitable mathematical or statistical test is selected. It could be a simple regression model to test relationships between associated factors. It can also be a forecasting model that can predict future events based on past trends of the disease.

Step 4: Hypothesis testing

Hypotheses testing encompasses environmental, laboratory or epidemiological data or a combination of all. Mainly, a hypothesis is tested by:

  1. Comparing with existing facts or analytical epidemiology.
  2. Assessing the role of chance.

A statistical test enables comparison of past data and the identification of patterns in data. This includes comparing observed patterns (of affected cases) to expected patterns (among unaffected cases).

Another type of analysis is the prediction or forecasting of future outcomes in terms of host mortality or morbidity (Hufnagel et al. 2004). It is not mentioned in the common pipeline but is used in epidemiological studies. This forecasting can be based on past incidents of disease occurrences or outcomes within a population. It is instrumental in the development of warning systems and public health strategies. Information on future disease epidemics helps in implementing preventive measures and treatment options (Soyiri & Reidpath 2013). Forecasting models, known as time series forecasting models are also crucial. They can either include only past incidents (univariate) or associated factors (climate, vector parameters, host behaviours etc.) known as multivariate models (Newman 2003). The later articles discuss this on the conceptual basis of the forecasting models. A systematic review of prominent models used in forecasting is also discussed.

Consequently, such an investigation provides answers related to the disease and its outbreak. Moreover, the analysis explains the causal agents and associated factors increasing risks in certain populations. Lastly, it is possible to predict and hence prevent or control future events of the disease.

Regression in epidemiology

Regression models are very useful in epidemiological studies. They help to explore relationships between several variables of a disease in a community (Suarez et al. 2017). For instance, they establish a cause-effect relationship between several factors that may contribute to epidemiology. These models are based on the assumption that variables associated with a disease influence each other. An epidemic of malaria in Ethiopia is a good example. Researches using regression models determined that meteorological factors like rainfall and temperature affected the seasonal transmission of the disease (Alemu et al. 2011). Besides this, they help estimate the mortality and morbidity rates of an epidemic. For instance, assessing mortality rates of the influenza pandemic of 2009 using Serfling Regression model, a modification of seasonal regression  (Viboud et al. 2010).

In regression models, two types of variables are used. The variable of interest is the dependent variable and the associated variable is the independent variable. Example of the dependent variable is disease outcomes like mortality and morbidity or disease risk. Example of an independent variable is the demographic profile of hosts, temperature, humidity, vector prevalence etc. There are three main types of regression models applicable in epidemiological studies. They depend upon the characteristics of the dependent variable (DP) (Bonita et al. 2006; Szklo & Nieto 2014).

  1. Linear Regression models (DP is continuous data).
  2. Logistic Regression models (DP is dichotomous data like Yes or No).
  3. Cox Proportional Hazards models (DP is the time from baseline or non-event to an event).
  4. Poisson Regression models (DP is the incident rate based on person or time).

Applicability of different regression models in the epidemiological investigation

Linear models are useful when the dependent variable is continuous type data like body weights in case of an obesity epidemic. They also reveal the relationship between different factors like urbanization and behaviours. This model assumes that the risk of a disease changes with risk factors in a linear fashion (Jewell 2009).

Logistic regression models, on the other hand, are useful, when the outcomes of diseases are binary (yes or no). Examples include death (dead or alive), disease (affected or unaffected) and recovery (recovered or not recovered). This is because, logistic regression is linear regression, but the outcome is the probability or odds ratio of an event occurring.

Cox Proportional Hazards models are relevant in survival or time-to-event data. In survival data, the outcome is time-dependent data. It represents the waiting time until an event occurs.  In epidemiology, these models are useful when exposure to risks increases the hazards towards the disease outcome by a constant factor (hazards ratio) (Szklo & Nieto 2014).

Poisson or log-linear regression models are used when the outcome of a disease is rates (or rate ratios). This refers to the rates of outcomes within a population, like in case of rare diseases (Szklo & Nieto 2014). This model assumes that outcome rates magnitude is dependent upon a combination of covariates and unpredictable factors by an exponential function. The proceeding article deals with the use of geospatial modelling in disease epidemic studies.


  • Alemu, A. et al., 2011. Climatic variables and malaria transmission dynamics in Jimma town, South West Ethiopia. Parasites and Vectors, 4(1), p.30.
  • Bonita, R., Beaglehole, R. & Kjellstrom, T., 2006. Basic Epidemiology First., Geneva, Switzerland: World Health Organization.
  • Dicker, R. et al., 2012. Investigating an Outbreak: Steps of an Outbreak Investigation. In Principles of Epidemiology in Public Health Practice: An Introduction to Applied Epidemiology and Biostatistics. Atlan: CDC, pp. 6-1-6–75.
  • Hufnagel, L., Brockmann, D. & Geisel, T., 2004. Forecast and control of epidemics in a globalized world. Proceedings of the National Academy of Sciences, 101(42), pp.15124–15129.
  • Jewell, N.P., 2009. Statistics for Epidemiology, Chapman & Hall/CRC.
  • Newman, S.C., 2003. Biostatistical Methods in Epidemiology, John Wiley & Sons.
  • Soyiri, I.N. & Reidpath, D.D., 2013. An overview of health forecasting. Environmental Health and Preventive Medicine, 18(1), pp.1–9.
  • Suarez, E. et al., 2017. Applications of Regression Models in Epidemiology, John Wiley & Sons.
  • Szklo, M. & Nieto, J., 2014. Basic Study Designs in Analytical Epidemiology. In M. Szklo & J. Nieto, eds. Epidemiology : Beyond the Basics. Jones & Bartlett Learning, pp. 3–44.
  • Viboud, C. et al., 2010. Preliminary Estimates of Mortality and Years of Life Lost Associated with the 2009 A/H1N1 Pandemic in the US and Comparison with Past Influenza Seasons. PLoS Currents, 2, p.RRN1153.
  • WHO, 2017. Disease Outbreaks. WHO. Available at: