Systematic review of forecasting models in disease epidemiology

By Avishek Majumder & Chandrika Kapagunta on November 4, 2017

In the previous article, the role and advantages of using forecasting models in disease epidemiology were discussed. Forecasting models are important tools assisting public health decision making. They help predict future disease trends, incidents and possible risks in a population or community. As discussed previously, many forecasting models are used to analyse time series data in epidemiological studies. In this article, a systematic review has been presented on several such methods and models. They have been used in the past by researchers and institutions to predict infectious disease events and patterns. In addition, these models have been used for multiple diseases caused by different pathogens like bacterial, viral, fungal or parasitic. Furthermore, they may involve vectors that accelerate further transmission.

Selection of time series models for epidemiological studies

Forecasting models are selected based on data properties, variables and aims and objectives of the study. Comparison of different forecasting models can help in determining and selecting the most appropriate model. This is because the progression of each disease epidemic is different (Zhang et al. 2013). Four main groups of forecasting models are popularly used for different infectious diseases. Each of them has been reviewed in below table in systematic review form.

Types of forecasting models used in epidemiology

Models based on Box-Jenkins methods

No.	Forecasting model	Disease and geographical region	Variables studied	Results
1	Moving Averages of Mixed Generalised Additive Model (MGAM) (Ma et al. 2013)	Bacillary Dysentery (BD) (Shanghai, China)	Daily meteorological data and BD case count.	Temperature significantly linearly associated with logarithmic BD count between 12-22°C. The predictive model showed good fitness with R2 of internal data at 0.875. Prediction effect on external data with a correlation coefficient of 0.859.
2	ARIMA (Dom et al. 2013)	Dengue (Subang Jaya, Malaysia)	Dengue incidence and climate variables.	ARIMA(2,0,0)(0,0,01)₅₂ was the best model with weekly variations. It could predict efficiently 4 weeks ahead. Performance of the model increased when climate variables included as external regressors.
3	Univariate SARIMA (Moosazadeh et al. 2014)	Tuberculosis (Iran)	Tuberculosis cases (monthly) per 100,000 population.	An average of 756.8(SD = 11.9) cases of Tuberculosis detected per month. Among four models, SARIMA (0,1,1)(0,1,1)₁₂ showed lowest AIC (12.78). This model predicted 16.75 cases per 100,000 people in 2014.
4	ARIMA (Wang et al. 2017)	Influenza (Ningbo, China)	Cases of Influenza-like-Illness, climate variables.	ARIMA(1,1,1)(1,1,0)₁₂was the best model fitting existing data Influenza rates in Ningbo were found to peak twice a year correlated to rains/cold

A systematic review of studies using Box-Jenkins methods in epidemiology

As seen from the studies for most diseases, climate variables were important in seasonal trends of the disease distribution. Both ARIMA and SARIMA are used for prediction of disease incidents and outcomes. However, SARIMA is more advantageous in case of inherent seasonal trends of a disease. This is because it yields more accurate prediction rates.

Models based on probabilistic models

No.	Forecasting model	Disease and geographical region	Variables studied	Results
1	Multi-step Polynomial Transformation (Chatterjee & Sarkar 2009)	Malaria (Chennai, India)	Slide Positive Rates (SPR) value, P.vivax deaths, temperature, humidity, rainfall.	High prediction power of model in predicting slide positivity rates and P.vivax deaths. Climate variables, disease incidence at zonal levels both influence prediction. Long term forecasting is efficient.
2	Multivariate Time Series model based on Monte Carlo simulations (Held et al. 2017)	Norovirus Gastroenteritis (Berlin, Germany)	Weekly counts of infection, Age, District.	The best model fitted included age-structured data with social contact data. Model 4 shows best final size, long term prediction curve.
3	Markov model along with Monte Carlo simulation (Rein et al. 2011)	Hepatitis C (USA)	Demographic data, Prevalence estimates of Full range of Hepatitis C disease state.	Estimated death due to HCV higher than reported deaths by 12.7%. According to forecast, HCV cases will peak between 2030-2035 and decline after 2060. End-stage liver disease cases at 38,600 in 2030.

No.

Forecasting model

Disease and geographical region

Variables studied

Results

Multi-step Polynomial Transformation (Chatterjee & Sarkar 2009)

Malaria (Chennai, India)

Slide Positive Rates (SPR) value, P.vivax deaths, temperature, humidity, rainfall.

High prediction power of model in predicting slide positivity rates and P.vivax deaths.

Climate variables, disease incidence at zonal levels both influence prediction.

Long term forecasting is efficient.

Multivariate Time Series model based on Monte Carlo simulations (Held et al. 2017)

Norovirus Gastroenteritis (Berlin, Germany)

Weekly counts of infection, Age, District.

The best model fitted included age-structured data with social contact data.

Model 4 shows best final size, long term prediction curve.

Markov model along with Monte Carlo simulation (Rein et al. 2011)

Hepatitis C (USA)

Demographic data, Prevalence estimates of Full range of Hepatitis C disease state.

Estimated death due to HCV higher than reported deaths by 12.7%.

According to forecast, HCV cases will peak between 2030-2035 and decline after 2060.

End-stage liver disease cases at 38,600 in 2030.

A systematic review of studies using probabilistic models in epidemiology

Probabilistic models are useful in disease prediction in situations of limited data or hidden relationships. In the data forecast values should have attached uncertainty (Held et al. 2017). Furthermore, probabilistic models help address the inherent difficulty in estimating disease epidemics. This is because probability rates are attached to the final predicted values.

Models based on spatiotemporal analysis methods

No.	Forecasting Model	Disease and geographical region	Variables studied	Results
1	Generalised Linear Mixed Model (Lowe et al. 2013)	Dengue (South East Brazil)	Notified Dengue fever counts per month, national cartographic data and levels of urbanization, climate and Oceanic Nino index.	Successful epidemic alerts can be issued for 81% of 54 regions. Predictions possible several months in advance.
2	Spatio-temporal hierarchical Bayesian model (Lowe et al. 2014)	Dengue (Brazil)	Confirmed dengue cases, Demographic density, urban population, monthly precipitation, temperature and altitude.	Different parts of Brazil regions had varying levels of risks. A low-medium level risk for host cities of World Cup. The model allowed for prediction 3 months in advance.
3	stsSEIR model (Lai et al. 2015)	H1N1 (Hong Kong)	Daily influenza cases, demographic data of patients, Population and Land usage data.	Immediate forecasts values (1-2 days) more sensitive than extended forecasts (6-7 days) R²value of 1-2 days forecast values higher than 6-7 days values. Model-predicted better for some areas of Hong Kong over others

No.

Forecasting Model

Disease and geographical region

Variables studied

Results

Generalised Linear Mixed Model (Lowe et al. 2013)

Dengue (South East Brazil)

Notified Dengue fever counts per month, national cartographic data and levels of urbanization, climate and Oceanic Nino index.

Successful epidemic alerts can be issued for 81% of 54 regions.

Predictions possible several months in advance.

Spatio-temporal hierarchical Bayesian model (Lowe et al. 2014)

Dengue (Brazil)

Confirmed dengue cases, Demographic density, urban population, monthly precipitation, temperature and altitude.

Different parts of Brazil regions had varying levels of risks.

A low-medium level risk for host cities of World Cup.

The model allowed for prediction 3 months in advance.

stsSEIR model (Lai et al. 2015)

H1N1 (Hong Kong)

Daily influenza cases, demographic data of patients, Population and Land usage data.

Immediate forecasts values (1-2 days) more sensitive than extended forecasts (6-7 days)

R²value of 1-2 days forecast values higher than 6-7 days values.

Model-predicted better for some areas of Hong Kong over others

A systematic review of studies using spatiotemporal methods in epidemiology

Spatio-temporal prediction models allow for future prediction. Furthermore, they also show probable high-risk areas in the future event of a disease epidemic. These spatiotemporal models are based on previous trends of incidents and climate factors. Therefore, they allow for better prediction of disease incidents or outcomes.

Models based on artificial neural networks

No.	Forecasting Model	Disease and Geographical Region	Variables studied	Results
1	Support Vector Machine- Firefly Algorithm model (SVM-FFA) (Ch et al. 2014)	Malaria (Bikaner and Jodhpur, India)	Malaria incidences, climate data like rainfall, temperature and humidity.	SVM-FFA model is more accurate than ARMA, ANN and SVM alone. Fit value, Normalized Mean Square Error values are lowest for SVM at 0.13 SVM-FFA model with incident rates and climate variables best
2	Hybrid model of Grey Model (GM) and Back Propagation Artificial Neural Networks (BP-ANN) (Gan et al. 2015)	Hepatitis B (China)	Hepatitis B Incident rates.	Prediction by proposed model more accurate than GM models. Relative error smallest for the proposed model.
3	Back Propagation Artificial Neural Networks (BP-ANN) (Pezeshki et al. 2016)	Cholera (Chabahar District, Iran)	Monthly and Seasonal average values of Cholera incidents and Climate variables (temperature, humidity, rainfall), distance from border and health centres.	The best model trained with climate and spatial data The optimized model predicted accurately predicted 80% in 100 villages with 44.4% specificity.
4	Machine Learning (ML) Pipeline based on Artificial Neural Network and Support Vector Machine (Colubri et al. 2016)	Ebola	Suspected or Positive Ebola cases, Clinical data, Laboratory data, Viral load of patients	ML-based prediction model useful in prognostic prediction of Ebola patients undergoing treatment. Several Clinical and Laboratory symptoms can predict patient prognosis.

A systematic review of studies based on artificial neural networks in epidemiology

Artificial neural networks (ANN) work ideally with limited data and high ambiguity by focusing on uncertainty problems. This is because ANN is useful in non-linear statistical modelling. Consequently, ANN-based prediction models can be trained upon the existing data to derive the best model.

Creating ideal forecasting models for epidemiology

In addition, a systematic review of different types of forecasting models for predicting epidemics was done. These models have their own inherent characteristics. Also, they can be used depending on the properties of the data and disease and aim. However, while developing or applying any model the limitations should also be discussed. This will help in inference better. Ideal forecasting models do not exist. But models that can accurately predict future outcomes and distribution patterns are needed. They will help frame better control and prevention strategies.

References

Ch, S. et al., 2014. A Support Vector Machine-Firefly Algorithm based forecasting model to determine malaria transmission. Neurocomputing, 129, pp.279–288.
Chatterjee, C. & Sarkar, R.R., 2009. Multi-Step Polynomial Regression Method to Model and Forecast Malaria Incidence. PloS One, 4(3), p.e4276.
Colubri, A. et al., 2016. Transforming Clinical Data into Actionable Prognosis Models: Machine-Learning Framework and Field-Deployable App to Predict Outcome of Ebola Patients. PLoS Neglected Tropical Diseases, 10(3), p.e0004549.
Dom, N.C. et al., 2013. Generating temporal model using climate variables for the prediction of dengue cases in Subang Jaya, Malaysia. Asian Pacific Journal of Tropical Disease, 3(5), pp.352–361.
Gan, R. et al., 2015. Application of a hybrid method combining grey model and back propagation artificial neural networks to forecast hepatitis B in China. Computational and Mathematical Methods in Medicine, 2015, p.ID 328273.
Held, L., Meyer, S. & Bracher, J., 2017. Probabilistic forecasting in infectious disease epidemiology: the 13th Armitage lecture. Statistics in Medicine.
Lai, P.C. et al., 2015. An early warning system for detecting H1N1 disease outbreak–a spatio-temporal approach. International Journal of Geographical Information Science, 29(7), pp.1251–1268.
Lowe, R. et al., 2014. Dengue outlook for the World Cup in Brazil: an early warning model framework driven by real-time seasonal climate forecasts. The Lancet Infectious Diseases, 14(7), pp.619–626.
Lowe, R. et al., 2013. The development of an early warning system for climate-sensitive disease risk with a focus on dengue epidemics in Brazil. Statistics in Medicine, 32(5), pp.864–883.
Ma, W. et al., 2013. Applied Mixed Generalized Additive Model to Assess the Effect of Temperature on the Incidence of Bacillary Dysentery and Its Forecast. PloS One, 8(4), p.62122.
Moosazadeh, M. et al., 2014. Forecasting Tuberculosis Incidence in Iran Using Box-Jenkins Models. Iranian Red Crescent Medical Journal, 16(5), p.e11779.
Pezeshki, Z. et al., 2016. Model of Cholera Forecasting Using Artificial Neural Network in Chabahar City, Iran. International Journal of Enteric Pathogens, 4(1), pp.23–30.
Rein, D.B. et al., 2011. Forecasting the morbidity and mortality associated with prevalent cases of pre-cirrhotic chronic hepatitis C in the United States. Digestive and Liver Disease, 43(1), pp.66–72.
Wang, C. et al., 2017. Epidemiological Features and Forecast Model Analysis for the Morbidity of Influenza in Ningbo, China, 2006–2014. International Journal of Environmental Research and Public Health, 14(6), p.559.
Zhang, X. et al., 2013. Comparative Study of Four Time Series Methods in Forecasting Typhoid Fever Incidence in China. PloS One, 8(5), p.e63116.