DengAI Predicting Disease Spread (A Tale of Two cities)

4 minute read

Published:

This is a short descriptive post based on our project done for the CS4642 module, Data Mining and Information Retrieval, at the University of Moratuwa. These particular wordings are my own and thus may not be exactly the same as what I have submitted as the final report for the course assignment :relaxed:.

Background to Data Mining Task

Dengue fever poses a serious threat to humankind in recent times. It is a mosquito-borne disease that occurs in tropical and subtropical parts of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. Severe cases are more dangerous — dengue fever can cause severe bleeding, low blood pressure, and even death. Because it is carried by mosquitoes, the transmission dynamics of dengue are related to climate variables such as temperature and precipitation. Although the relationship to climate is complex, a growing number of scientists argue that climate change is likely to produce distributional shifts that will have significant public health implications worldwide.

Methodology

We split our task into four main categories and proceeded by researching each one independently.

Data Cleaning

This was the first phase of data mining. Data is often noisy and inconsistent due to several reasons such as reading errors, missing data, different formats, and different units. We needed to ensure the data was free of these issues before being processed for training or testing. As part of this, in San Juan and Iquitos there were many null and missing values that needed to be filled. We followed different strategies over time to improve the score:

  1. Based on Previous Value

This approach is like drawing a line connecting two available points and calculating the missing points in between. This works well when the value is closely related to the previous observation.

  • Based on Mean Value

This was not effective since it fills all missing places with the mean of the entire column.

  • Based on Median Value

This was worse than the previous case for most features, as it fills missing places with the median of the entire column.

  • Based on Regression

This was the best out of the four strategies since it changes with respect to the time range of weeks.

We also removed columns that had fewer than 6 non-null values out of all 21 features (in most cases these contained only plantation details, while other correlated features were found to be missing).

Feature Analysis

Not all features are actually correlated with the output, so we calculated the Pearson correlation coefficient for each feature against the output to find how they correlate. This process was carried out separately for San Juan and Iquitos because significant differences were found between their outputs.

We carried out the same process for the following scenarios:

  1. Use the same correlation features common to both cities
  2. Use different correlation features for each city
  3. Use different and time-shifted features for each city

Data Normalization

Data normalization is important when different units of the same data are used or when value ranges vary across features, as this may cause false correlations and wrong predictions. We followed different types of normalization:

  1. Min-Max Normalization
  2. Z-Score Normalization

Regression Techniques

As a final step for predicting values based on the model, we applied regression techniques and obtained different mean absolute errors for both San Juan and Iquitos:

  1. Poisson
  2. Negative Binomial
  3. Linear Regression

Tools Used

During this research, we used different tools related to data mining.

  • Weka Tool

This tool has functionalities to handle missing data, find correlations, and predict based on provided models. However, it is not easily customizable, so we preferred to use custom implementations.

  • R Language

This is the language commonly used for data mining tasks. We made use of different libraries such as tidyverse, corrplot, magrittr, zoo, RColorBrewer, gridExtra, and MASS.

Conclusion

In this project, we proposed an analysis scheme based on the following techniques, which performed best among the alternatives. For cleaning the data, the best option was regression with time. For feature analysis, we used feature values shifted by three weeks and found different Pearson correlation values. For normalization, Z-score normalization performed best. Finally, we built the model using eighty percent of the training data and the remaining as test data, minimizing the mean absolute error through continuous iteration. Based on different regression techniques, we predicted the output and obtained the best score using negative binomial regression. Check out this repo for our implementation.

References

  1. https://www.gnu.org/software/octave/doc/v4.0.3/Simple-File-I_002fO.html
  2. https://datascienceplus.com/imputing-missing-data-with-r-mice-package/
  3. https://www.r-bloggers.com/measuring-persistence-in-a-time-series-application-of-rolling-window-regression/
  4. http://www.mathematica-journal.com/2013/06/negative-binomial-regressio