Car crashes are an unfortunate thing that occur every day in the US. Even more terrible are those that result in fatalities, of which countless occur every year. With so many things playing a part in causing these terrible occurrences, our group was interested to find what the main factors were that led to these horrible events. After some searching, we found a dataset that gave us just what we were looking for.
As we poured through all the data related to 2019 fatal motor vehicle crashes in the US, we were amazed by the number of variables in the dataset. While the mass amount of data collected in our dataset choice was very interesting, it also presented itself as quite overwhelming. With all of the variables being taken into account, it was hard during our initial time with the data to identify what the most important and influential variables were for causing the fatal crashes they related to. However, after completing our data exploration assignment, we found much more clarity with the role of certain variables. These insights are what helped us to narrow down the two questions we explore in this paper.
For our first question, we sought to better understand what influences people driving drunk, specifically seeking to answer the question: Can we predict whether or not a driver was drunk based on factors such as the time of day, the day of week, and the number of fatalities? When initially considering this question, we compared a number of other variables to see if they were of any value as predictors. After some investigation, we were able to narrow down the best variables for predicting whether a driver is drunk to those in the question.
The second question we posed about the data is: Can the driver’s blood alcohol content (BAC) and EMS response time be used to predict whether or not the driver died in the crash? Through our investigations in our data exploration, EMS response time and BAC stood out as two very important factors in the deadliness of the crash. As such, we decided to pursue just how effective those two variables would be at predicting the number of deaths.
Our dataset comes from the 2019 Fatality Analysis Reporting System (FARS). This data was collected and compiled by the National Highway Traffic Safety Administration under the U.S. Department of Transportation. The dataset contains over a hundred variables split up across multiple spreadsheets which were matched by an ID case file. The dataset is made up of over 30,000 case files involving fatal motor vehicle accidents. A variety of information surrounding the circumstances of each case is provided, however, the variables of interest that we chose to focus on include blood alcohol content (BAC) of the driver, number of fatalities, as well as circumstances on time such as time of day and day of the week. Another variable of interest that we sought to look at was EMS response time. This variable was not present in the original dataset, but rather one we created through manipulation of some existing variables. These included notification time of EMS services and arrival of EMS on scene. The following table shows each of the variables of interest:
Incidents in which two lines share the same case number represent different vehicles that were involved in the accident. Additionally, BAC level is represented as two digit numbers, ranging from 1-94 without the traditionally recognized decimal points. EMS response time is recorded in minutes.
The first visualization we created was a bar plot highlighting the relationship between the hour of the day and the proportion of fatal accidents involving a drunk driver. As can be seen in the figure below, there is a clear increase in the proportion of drunk driving accidents during the later hours of the day and into the early morning. This proportion peaks at around 3:00 AM. Intuitively, this makes sense as most alcohol sales and alcohol consumption in general takes place during the late hours of the day and early hours of the morning.
The next plot we created was a visualization of the proportion of fatal accidents involving a drunk driver vs day of the week. As can be seen, Friday, Saturday, and Sunday display a clear increase compared to other days. This also makes sense because weekends tend to be when most alcoholic beverages are consumed.
The final plot we created pertains to the number of fatalities and proportion of fatal accidents involving alcohol. The proportion increases significantly when there are 6 fatalities, but this may be due to the limited number of observations involving 6 fatalities. Nonetheless, a drunk driver may be extremely reckless causing a large number of deaths compared to a sober driver.
The first question we have attempted to answer is “Can we predict whether or not a driver was drunk based on factors such as the time of day, the day of week, and the number of fatalities?” We began our research process by utilizing visualizations of the data with the intention of uncovering patterns and relationships to guide our modeling process. Preliminary explorations of these figures revealed clear relationships between the aforementioned factors within the research question.
With these visualizations in mind, we decided to construct a binomial model to predict the probability the driver in the accident is drunk based on certain factors and circumstances of the accident. The factors included an accident at night, an accident over the weekend, and the number of fatalities involved. We first created a “drunk” variable that contains the value one if the driver is drunk and zero if the driver is sober and removed any observation that does not have this information. We then created a night and weekend variable following this same process. The weekend variable is one if the day was Friday, Saturday, and Sunday, and zero if not. The night variable is one if the lighting of the accident was not daylight and zero if it was. We then created a testing and training set where the training set contained 67% of the observations chosen at random. We then fitted the model on the training set and found these values.
We then took the formula for the regression and fitted probabilities to the testing set using eta and the log odds formula.
Above is a histogram displaying the distribution of the probabilities. The high p accidents happened at night, happened over the weekend, and involved multiple deaths. The low p accidents have the opposite. Intuitively this makes sense. We then created predictions on the testing set based on the model and found the optimal cut off point for p to be 0.4220516 based on the optimal cutoff function. If p was above .4220516 the value of the prediction variable would be one. If p was below, the value would be zero.
Using the prediction variable, we created a confusion matrix on the testing set to test the model. We found the model to be accurate 75.58% of the time.
We visualized the model by using a ROC plot and found the model to perform well above the chance line.
Overall, the model performs well at answering the question of whether or not a driver is drunk based on certain factors. Many drunk driving accidents are similar in circumstances which helps add predictability.
The second question we attempted to answer was “Can the driver’s BAC and the EMS response time be used to determine the fatality of the driver?” Before employing any statistical methods to answer this question, we first examined the relationship between BAC and driver fatality, and EMS response time and driver fatality. It can be seen in the boxplots below that there tends to be a higher BAC and longer EMS response times associated with driver fatalities:
The first model used to predict whether an accident resulted in the fatality of the driver was a k-NN model. The goal of this model was to use the BAC and EMS Response time to determine if the driver resulted in a fatality or not. The plot below shows all of the data available; the Response Time versus the BAC and then if it was a fatality.
In order to have a more accurate prediction, these values were standardized:
We then used 80% of this data to train the model, and 20% to test the model. In order to use k-NN, we had to determine the best value of k to use. We did this by training the model with certain values of k, and then determining the accuracy of the predictions by using the model on the test dataset. After looking at values up to 125 for k, we looked at values between 125 and 320 to find the most accurate k. In the plot below, it is evident that the best k value was 205, with an accuracy of around 61%.
Therefore, by using the k value of 205 and the k-NN model, we can predict the driver fatality correctly 61% of the time by using the BAC and EMS Response Time.
The next three models we considered for this problem were a linear model to predict fatality based on standardized BAC and standardized EMS Response Time, and then two linear models to determine if the two variables could be used separately to predict fatality. In order to use linear models for this, we had to convert fatality to a numerical variable. Fatal is represented by 1, and Non-Fatal is represented by 0. This model had an accuracy of 58.4%, which was slightly worse than the k-NN model. Based on the coefficients, the BAC is weighted slightly more than Response_Time in determining the fatality of the driver.
The model using only BAC to determine fatality was accurate only 53.3% of the time; and the model using only EMS Response time was accurate only 56.2% of the time. Therefore, the best model incorporates both BAC and EMS Response Time to predict fatality.
In our first question, we tried to see if we could use time of day, day of the week, and number of fatalities in a crash to predict if the driver was drunk. A binomial and confusion model were employed to find our prediction variable and test it. We found that we were able to accurately predict whether the driver was intoxicated based on the inputted factors with an accuracy of 75.58%. This outcome exceeded our initial expectations of the factors used to predict the drunkenness of the driver, and our team was quite pleased with the results. In our second question, we sought to understand if driver BAC and EMS response time could be used to predict whether the driver in the crash died. Using k-NN modeling, we were able to confirm that the two variables together acted as good predictors for the fatality of the driver, which gave us a 61% accuracy rate. Along with that, we were also able to measure the influence of both variables alone on predicting fatality of the driver, and found that EMS response time had a 56.2% accuracy rate and BAC of driver had a 53.3% accuracy rate. Even though the individual success rate of BAC and EMS response time is relatively low, both lend themselves to our other results in providing important and relevant data for the real world.
Every year, countless people experience car accidents ranging from little fender benders to serious collisions. Even worse are those in which people perish, which are the same accidents that are recorded in this data set. Although cars have been around for a long time, the issue of how to best address accidents, especially fatal ones, is still very much relevant. It is findings like ours that help to guide effective legislation and policing. For instance, in the case of our first question, there are certain times of day and days of the week, along with number fatalities, that show a greater likelihood for drunk drivers. Using this information, police could know when to be extra vigilant about the presence of drunk drivers on the road. In the case of our second question, boxplots were used to show how both higher BAC of drivers and longer EMS response times are associated with crashes that cause the driver to perish. Focusing on EMS response time, infrastructure decisions could be made in legislation to strive for faster EMS response times. Past just those two examples, there are countless ways in which this data can be used to make important decisions to try to prevent fatal crashes.
As a whole, the dataset our team worked with was very solid. It provided tons of variables that we were not able to fully utilize, as to do so would have required a much larger analysis. Despite the quality of this dataset, there is certainly data that would have helped to have had. Particularly, data on non-fatal crashes could help to remove any possible biases in our results and provide a more well-rounded picture of trends in crashes in the United States. Such an addition of data could allow us to strive for deeper insights. In terms of our team’s first question, such data would have allowed there to be an additional value in the possible number of deaths from a crash, that being zero, which in turn could provide more accurate results from our models. In the case of the second question, we could expand it to just look for whether there was any death at all in the crash as opposed to just the driver, since we would be able to account for crashes with no deaths. On top of the idea of adding more data, further breakdown in the data by geography is something we could have further investigated. Since the dataset indicates which state each crash occurred in, we could have applied these same questions to each state. Such results could have provided additional insights in how trends in crashes differ by state.
In total, this dataset bore great insights into a very real issue in our world. Deadly crashes in the United States continue to be an issue that our society must look to fix. It is through datasets such as this one that we can find ways to make the roads a bit safer for everyone.