Evansville "Shots Reported" Data -- Part II

Introduction

This is part II of the Evansville “shots reported” series. Open data is important so that others can look at the data from a fresh perspective and potentially see patterns that others missed. Part of that process is putting the data in new forms so that it yield insights and even predictions. The challenge of the Evansville “shots reported” data set is understanding how many observations are missing and how that impacts the data set.

Missingness

Missing data is a common problem and this data set has all hallmarks of “real” data: it’s messy. Missing data is typically of two types: one where it is randomly distributed and one where it is not. An R package called ggTimeSeries provided a function for a calendar heat map. Once the data were plotted, it was obvious that both types of missingness were present.

The empty or “NAs” are the gray boxes. When included in this calendar heat map, their distribution shows both random and non-random missing data.

The empty or “NAs” are the gray boxes. When included in this calendar heat map, their distribution shows both random and non-random missing data.

Imputation—Set to Zero

Usually the easiest solution is the best solution—Occam’s Razor—so all of the NAs were set to zero. Months where the data were mostly complete showed some variation between red and green whereas July took on a largely green hue. This seemed unrealistic as July 4 is typically celebrated with fire works and one would expect a number of “shots reported” on and around that holiday.

The NAs were set to “0".

The NAs were set to “0".

Imputation— ImputeTS package

The next effort at imputation was to use one of the many R packages. The imputeTS package had the right name and it imputes by a single time variable. Because of the use of a single variable, the imputation was likely to be simpler. While there were numerous versions tried, none produced the red-green mosaic that suggested legal holidays were relevant. The closest model was the Seadec method, but notice it turned the last week of June and the first week of July a uniform orange.

evansville_shots_reported_heatmap_na_seadec.jpg

Imputation — MICE Random Forests

The MICE package in R allows for multivariate imputation. A simple data frame was built that added both weekends and legal holidays as variables. Then the Random Forest method was used to impute the values. Notice that July 3 and July 4 are both orange, but the surrounding days are green. This is more in line with what experience and common sense would suggest. Notice that it is also suggesting that December 28 would also be a day where many shots were reported.

evansville_shots_reported_heatmap_na_randomForest.jpg

Summary

There’s a lot of data missing in this data set. The full month of October and a lot of July and August. I’m guessing that a dispatch supervisor went on vacation in July and maybe August too. The “shots reported” calls were likely designated as something else like public disturbance or suspicious circumstances. Imputation programs can be helpful if the data is missing at random and there’s only a small amount missing. However, that is not the case with this data set. Below is the monthly shots reported data using the imputed values from the MICE package and mid-pointing the October data.

evansville_shots_reported_by_month_imputed.jpg