Common pipeline for statistical analysis in epidemiological studies

Previous articles discussed the need for statistical analysis and modeling of epidemiological studies in public health studies. Analysis in epidemiological studies typically requires descriptive and analytical methods. Descriptive analysis of epidemiological data includes hypotheses development. This is based on variability of disease outcome rates with demographic variables. On the other hand, analytical epidemiology determines cause or mode of disease epidemic outbreak. In general, both descriptive and analytical studies are undertaken together. This is because results of descriptive analysis offer clues for hypotheses development and testing in analytical studies. In this article, the common steps in conducting data analysis of epidemiological studies are explained. Moreover, statistical models to study factors associated with epidemic diseases are explained.

Steps in conducting Statistical analysis in epidemiology

There are several common steps in conducting modeling studies of diseases. They are modified according to the study context or data. These steps include descriptive analysis of the data, hypothesis development, model selection and hypothesis testing (Dicker, Coronado, Koo, & Parrish, 2012). Each of these steps serve a purpose and is interlinked. Each step influences the next step. The first step is to establish the existence of an outbreak and its diagnosis. The correct nature of the disease is confirmed in this step. Outbreak or epidemic is defined as the occurrence of cases of diseases in excess in a defined community, geographical area or season (WHO 2017). Once confirmed, the researchers collect data through active or passive biosurveillance. It involves several types of information collection. This is presented in Figure 1 below.

Types of information collected during biosurveillance. (Source: Dicker, Coronado, Koo, & Parrish, 2012)

Types of information collected during biosurveillance. (Source: Dicker, Coronado, Koo, & Parrish, 2012)

After data collection, the next step is to perform data analysis. This step helps answer several questions about the disease. These questions include, who, where, when, what, how and why about the disease. It also seeks to understand its agent or pathogen, hosts affected and the associated factors. The four main steps in data analysis of epidemiology data are (Dicker et al. 2012):

Steps in Epidemiological Investigation (Source: Dicker et al., 2012)

Steps in Epidemiological Investigation (Source: Dicker et al., 2012)

Step 1: Descriptive epidemiology

In this step, the aim is to describe the characteristics and trends of the epidemic. It defines the person, place and time. It is referred to as ‘epidemiological triad for descriptive epidemiology. This includes summarization of the data on the basis of variables. It reveals the following:

  1. Trends of disease prevalence and spread over time.
  2. Geographical distribution of the disease.
  3. Populations or communities or groups affected by the disease

A major outcome of this step is identifying the at-risk population for the disease. Knowledge about all these aspects will help in developing a hypothesis. The hypothesis is based on the cause or mode of disease outbreak. Common statistical analysis includes:

  • measures of central tendencies and spread,
  • trend analysis,
  • clustering analysis
  • and association measures.

Step 2: Hypothesis development

Based on the data summary of trends in time, place and person of a disease, a hypothesis is developed. It addresses either or all of the following questions:

  1. What is the causative agent of the disease (pathogen)?
  2. What is the mode of transmission of the disease (vehicle or vector)?
  3. Which exposures caused the disease outbreak?

The hypothesis is developed depending upon the background information of the disease, with respect to host, agent and environment. It is referred to as ‘epidemiological triad for analytical epidemiology’.

Step 3: Analytical test or model selection

Depending upon the data characteristics, included variables and the objective or hypothesis, a suitable mathematical or statistical test is selected. It could be a simple regression model to test relationships between associated factors. It can also be a forecasting model that can predict future events based on past trends of the disease.

Step 4: Hypothesis testing

Hypotheses testing encompasses environmental, laboratory or epidemiological data or a combination of all. Mainly, hypothesis is tested by:

  1. Comparing with existing facts or analytical epidemiology.
  2. Assessing the role of chance.

A statistical test enables comparison of past data and identification of patterns in data. This includes comparing observed patterns (of affected cases) to expected patterns (among unaffected cases).

Another type of analysis is prediction or forecasting of future outcomes in terms of host mortality or morbidity (Hufnagel et al. 2004). It is not mentioned in the common pipeline but is used in epidemiological studies. This forecasting can be based on past incidents of disease occurrences or outcomes within a population. It is instrumental in development of warning systems and public health strategies. Information on future disease epidemics help in implementing preventive measures and treatment options (Soyiri & Reidpath 2013). Forecasting models, known as time series forecasting models are also crucial. They can either include only past incidents (univariate) or associated factors (climate, vector parameters, host behaviors etc.) known as multivariate models (Newman 2003). The later articles discuss this on the conceptual basis of the forecasting models. Systematic review of prominent models used in forecasting is also discussed.

Consequently, such an investigation provides answers related to the disease and its outbreak. Moreover, the analysis explains the causal agents and associated factors increasing risks in certain populations. Lastly, it is possible to predict and hence prevent or control future events of the disease.

Regression in epidemiology

Regression models are very useful in epidemiological studies. They help to explore relationships between several variables of a disease in a community (Suarez et al. 2017). For instance they establish a cause-effect relationship between several factors that may contribute to epidemiology. These models are based on the assumption that variables associated with a disease influence each other. An epidemic of malaria in Ethiopia is a good example. Researches using regression models determined that meteorological factors like rainfall, and temperature affected the seasonal transmission of the disease (Alemu et al. 2011). Besides this, they help estimate mortality and morbidity rates of an epidemic. For instance, assessing mortality rates of the influenza pandemic of 2009 using Serfling Regression model, a modification of seasonal regression  (Viboud et al. 2010).

In regression models, two types of variables are used. The variable of interest is the dependent variable and the associated variable is the independent variable. Example of dependent variable is disease outcomes like mortality and morbidity or disease risk. Example of independent variable is demographic profile of hosts, temperature, humidity, vector prevalence etc. There are three main types of regression models applicable in epidemiological studies. They depend upon the characteristics of dependent variable (DP) (Bonita et al. 2006; Szklo & Nieto 2014).

  1. Linear Regression models (DP is continuous data).
  2. Logistic Regression models (DP is dichotomous data like Yes or No).
  3. Cox Proportional Hazards models (DP is time from baseline or non-event to event).
  4. Poisson Regression models (DP is incident rate based on person or time).

Applicability of different regression models in epidemiological investigation

Linear models are useful when the dependent variable is continuous type data like body weights in case of obesity epidemic. They also reveal the relationship between different factors like urbanization and behaviors. This model assumes that the risk of a disease changes with risk factors in a linear fashion (Jewell 2009).

Logistic regression models, on the other hand are useful, when the outcomes of diseases are binary (yes or no). Examples include death (dead or alive), disease (affected or unaffected) and recovery (recovered or not recovered). This is because, logistic regression is linear regression, but the outcome is the probability or odds ratio of an event occurring.

Cox Proportional Hazards models are relevant in survival or time-to-event data. In survival data, the outcome is time-dependent data. It represents the waiting time until an event occurs.  In epidemiology, these models are useful when exposure to risks increases the hazards towards the disease outcome by a constant factor (hazards ratio) (Szklo & Nieto 2014).

Poisson or log-linear regression models is used when the outcome of a disease is rates (or rate ratios). This refers to the rates of outcomes within a population, like in case of rare diseases (Szklo & Nieto 2014). This model assumes that outcome rates magnitude is dependent upon a combination of covariates and unpredictable factors by an exponential function. The proceeding article deals with the use of geospatial modeling in disease epidemic studies.


  • Alemu, A. et al., 2011. Climatic variables and malaria transmission dynamics in Jimma town, South West Ethiopia. Parasites and Vectors, 4(1), p.30.
  • Bonita, R., Beaglehole, R. & Kjellstrom, T., 2006. Basic Epidemiology First., Geneva, Swizerland: World Health Organization.
  • Dicker, R. et al., 2012. Investigating an Outbreak: Steps of an Outbreak Investigation. In Principles of Epidemiology in Public Health Practice: An Introduction to Applied Epidemiology and Biostatistics. Atlan: CDC, pp. 6-1-6–75.
  • Hufnagel, L., Brockmann, D. & Geisel, T., 2004. Forecast and control of epidemics in a globalized world. Proceedings of the National Academy of Sciences, 101(42), pp.15124–15129.
  • Jewell, N.P., 2009. Statistics for Epidemiology, Chapman & Hall/CRC.
  • Newman, S.C., 2003. Biostatistical Methods in Epidemiology, John Wiley & Sons.
  • Soyiri, I.N. & Reidpath, D.D., 2013. An overview of health forecasting. Environmental Health and Preventive Medicine, 18(1), pp.1–9.
  • Suarez, E. et al., 2017. Applications of Regression Models in Epidemiology, John Wiley & Sons.
  • Szklo, M. & Nieto, J., 2014. Basic Study Designs in Analytical Epidemiology. In M. Szklo & J. Nieto, eds. Epidemiology : Beyond the Basics. Jones & Bartlett Learning, pp. 3–44.
  • Viboud, C. et al., 2010. Preliminary Estimates of Mortality and Years of Life Lost Associated with the 2009 A/H1N1 Pandemic in the US and Comparison with Past Influenza Seasons. PLoS Currents, 2, p.RRN1153.
  • WHO, 2017. Disease Outbreaks. WHO. Available at:
Chandrika Kapagunta

Chandrika Kapagunta

Research Analyst at Project Guru
Chandrika is a nature enthusiast with special love for the marine world. Her Master’s degree in Marine Biotechnology and Scuba Diving experience has made her a strong advocate of environment and marine conservation, especially through bioremediation. She believes in finding solutions of everyday human problems in nature, be it medicines, technology or philosophy. Having worked as a volunteer at The Bombay Natural History Society and as a Senior Research Fellow at Central Institute of Fisheries Education, she has had exposure to the current state of the academic research, specifically in the field of environmental biotechnology.
Chandrika Kapagunta

Related articles


We are looking for candidates who have completed their master's degree or Ph.D. Click here to know more about our vacancies.