INTRODUCTION

The worldwide coronavirus disease (COVID-19) is a viral infection that generates severe respiratory syndrome and has resulted in over 1 million deaths worldwide since its outbreak. In the U.S., from late January until the present, we have seen both the total number of confirmed cases and the total number of deaths undergo exponential growth. As of early October 2020, the total number of confirmed cases in the U.S. reached 7.5 million, with the total number of recovered cases hitting 4.3 million. In comparison to China, where the spread of the virus has slowed and entered a steady state, the United States continues to experience exponential growth in both COVID cases and deaths. Statistically, we are interested in exploring potential causal factors contributing to and exacerbating COVID-19 spread and mortality. In particular, rather than directly taking the mortality rate as our dependent variable, the case fatality ratio gives a more precise measure of the disease’s lethality by reporting the number of deaths divided by the number of confirmed cases in an area. Because COVID infection rates and case fatality ratio vary substantially across the United States, we performed our analysis on individual counties, allowing us to compare case fatality ratio to county-level metrics. This analysis will provide valuable information to policy makers interested in learning which factors are predictive of high rates of COVID infection and mortality.

Besides exploring the factors behind the COVID-19 case fatality ratio, we were interested in understanding the disease’s economic impact across the United States. Specifically, we wanted to evaluate factors that might determine which states are more likely to suffer a severe detrimental impact as a result of COVID-19. Between Q1 2020 and Q2 2020 some industries have experienced a more dramatic drop in percent contribution to GDP than others. The Energy and Metal sectors, for example, have lost 15% in their overall contribution to GDP. One might therefore expect states that are highly dependent on these sectors, such as Wyoming and Texas, to suffer a severe economic impact due to COVID-19. Similar to changes in GDP, most economic metrics are directly reflected by regions’ core industries competence and performance. We are therefore interested in evaluating COVID-19’s impacts on different industries by different regions in the U.S. Statistically, it would explicitly illustrate which industry in a given region has been most severely impacted by the virus. From a policy standpoint, these results could help the local government execute favorable policies to prop up industries harmed disproportionately by the virus. Additionally, these results could help guide the Federal Reserve in issuing a stimulus package for the U.S. economy, ameliorating the catastrophic COVID-19 impacts on the countries’ economic development in the post-COVID stage.

Our analysis addresses two primary questions – first, “What are some of the deterministic factors that are responsible for COVID-19’s case fatality ratio?”; second, “What factors are responsible for state-level differences in the economic impact of COVID-19?” These questions are interesting for investigation because the insights from each could inform a county, a state, a region, or even a country about specific factors that may influence the spread of COVID-19, as well as the industries most in need of economic help to avoid collapse.

DATA

We downloaded our major dataset, current COVID-19 datasets, from the John Hopkins University Center for Systems Science and Engineering (JHU CSSE) data repository using the covid19.analytics R package. Datasets are continuously updated to provide the latest information on COVID-19. All of the datasets that we used are censuses, rather than samples from a larger population. Each observation (row) represents results for a specific geographical region (e.g., U.S. county). The time-series(ts) datasets we used in our analysis incorporate the number of confirmed cases, deaths and recovered cases, allowing us to relate state-level and county-level COVID statistic with other state-level and county-level variables. As for the specific dataset we used, the ts-confirmed dataset is a timeseries table of confirmed COVID-19 cases across the world, separated by country and region. This dataset includes 269 observations as of 2020-11-17. Fields include Province.State which lists the province or state, Country.Region which lists the country or region, Latitude and Longitude, plus a set of date columns. Similarly, the dataset ts-deaths is a ts table of confirmed COVID 19 deaths across the world, which includes 269 observations as of 2020-11-17 with the same coverage of variables. A ts-recovered is another table of confirmed COVID 19 recovered cases across the world, separated by country and region. This dataset includes 256 observations as of 2020-11-17 with the same variables as the other ts datasets.

Our first question specifically investigates the case fatality ratio. We used the aggregated dataset, a non-ts table displaying confirmed cases, deaths, recovered cases, active cases, incident rate, and case fatality ratio. For our analysis, we only used the fields Province_State, which reports the U.S. state, Admin2 which reports the U.S. county, and case fatality ratio which reports the case fatality ratio, the ratio of fatal cases to confirmed cases. The following table displays the most important variables that we used in our analysis.

To further explore the explanatory variables that could explain the dependent variable - case fatality ratio, we consulted the county complete dataset downloaded from the University of Oxford COVID-19 Research Center. The dataset aggregates potential causal factors into various categories, including various age groups (e.g. age_under_5_2017), educational attainment (e.g. bachelors_2017), racial groups (e.g. white_2017), etc. In our EDA, we conducted multiple simple linear regression models by using some of variables in each of the representative category to try to explain its effect on the case_fatality ratio. In our exploration on the first final question- the combination of deterministic factors that could result in high predictive power, we applied the dataset twice - first, we used all potential causal factors to conduct selection processes of final variables of interest; second, using the same suite of independent variables we applied an alternative regression technique, random forest. In particular, the county complete dataset has 3142 observations that collects data results on a county’s level, which can be considered as a comprehensive census for analyzing variables of interest among various aspects/fields. However, the dataset that Oxford Research Center used dated back to 2017, so we were still concerned with data inconsistencies due to this temporal mismatch. Moreover, in assessing factors that may influence case fatality ratio, we obtained additional county-level data from outside sources. Specifically, we imported the hospital beds USA dataset that obtained from the COVID Tracking Project, which is an authoritative organization collecting datasets from the CDC on a state level and health organizations on a county level. The hospital beds USA dataset is comprised of 5713 observations with each observation representing a county within a given state. In particular, to investigate the relationship between hospital capacity related metrics and case fatality ratio in our first question, we were interested in hospital beds and the pop that recorded at the beginning of 2020 from this dataset. As for hospital beds, it also indicates the type of each bed - “ICU”/“Acute”/“Psychatric”/“Other”, which adds more accuracy in the data collection process.

Our second question specifically investigates state-level differences in the economic impact of COVID-19. We included three datasets that were collected by the U.S. Bureau of Economic Analysis: first, the dataset State Change GDP helps to explicitly display the extent of how COVID-19 impacts each state since its outbreak in the U.S., measured by the change in % of GDP since March, 2020; second, the dataset GDP by Region and Industry assesses impacts on the regions’s level but with the illustration of industries’ performance under the pandemic era; lastly, the dataset Percent_GDP_by_State shows a state-level composition of industries that are responsible for states’ GDP on a much more granular level. The State Change GDP dataset is a table displaying state-level GDP between Q1 2019 and Q2 2020, from which we calculated percent change in GDP between Q2 2019 and Q2 2020. This dataset has 51 observations pertaining to the 50 U.S. states plus Washington D.C. This dataset represents a full census of state level GDP data for the United States. Below is a heat map displaying the extent of COVID’s influences on each state in terms of GDP changes.

To further examine changes in contributions to GDP across industries and U.S regions, we incorporated the dataset GDP by Region and Industry into our analysis. This dataset shows Contributions to Percent Change in GDP by U.S. Region from Q1 2020 to Q2 2020, and represents summary values for various industries across the United States and thus is not a sample drawn from a larger population. This dataset contains 8 observations, pertaining to geographic regions within the United States. Industries are reported as our variables of interest, including Agriculture, Manufacturing, and Retail. Below is a descriptive figure that displays the contributions to GDP changes by industries and by regions.

Our last dataset Percent_GDP_by_State decomposes state level GDP into different industries. It is an informative table that is comprised of 20 observations with each observation representing different industries. All columns are our variables of interest since we need to address the second final question on a state level.

RESULTS

Question 1: What are some of the deterministic factors that are responsible for COVID-19’s case fatality ratio?

In our EDA, we initially selected our variables of interest from the following major categories: age class, racial makeup, and educational attainment. By running simple linear regression models using variables of interest under each of the above categories, we concluded that, due to the low R-squared values for all models, the simple linear regression models lacked predictive power. We then joined the U.S. aggregated dataset to county_complete which contains demographic information for U.S. counties. The resultant dataframe reported both COVID-19 data (confirmed cases, deaths, case fatality ratio, etc.) as well as demographic data such as racial makeup, average age, and poverty rate. We subset this dataframe to include only those independent variables that we thought would be relevant to predicting COVID-19 case fatality ratio. To avoid problems of multicollinearity, we plotted a correlogram showing pariwise correlation coefficients for all independent variables. Below is the correlogram incorporating almost 80% of the potential causal factors. Next, we performed a stepwise variable selection routine, using all subsets, backward selection, forward selection, and stepwise selection methods. The results from the all subsets model selection ranked the variables of interest from high to low explanatory power: age_under_5_2017, median age, black, native, hs_grad_2017, median household income, and persons_per_household_2017. The full model, with all 5 predictor variables, generated an R-squared value of 18.34. By assessing variables with high explanatory power, we also detected the multicollinearity issue as we discussed earlier. Therefore, we examined each category, and selected the variable with the highest independence from the rest of variables. Similarly, the result that we obtained by performing a backward selection also converted to the result that by using five explanatory variables, our predictive model would result in low RMSE, MAE, and RMSESD. Eventually, this process yielded the following variables in our final model: age_under_5_2017, black_2017, native_2017, hs_grad_2017, persons_per_household_2017.

In order to assess the fit of our model we used a 10 fold cross validation on the final model we created previously, using 80% of our data as our training data and 20% as our testing data. After each fold, we calculated mean average error (MAE) on the testing dataset. The mean MAE value for the 10-fold cross validation was 0.01239817. Additionally, after each fold we calculated the residuals for the testing data and exported these as a numeric vector. We created a frequency histogram and density plot to make sure the residuals were normally distributed with a mean of zero.

To compare the concordance between alternate regression techniques, we applied a random forests regression, using the same 5 variables that were selected for our final model in an earlier step. Random Forests is a machine learning algorithm developed by Breiman (2000) and used for prediction and classification. It is a classification-tree based method that trains individual trees on both a subset of the data and a subset of predictors. A “forest” of trees is grown and the mean predictions of the trees are used as the predicted outcome. For the variable importance plots, the algorithm holds all variables constant at their mean, except for the one being tested. The variables being tested are randomly changed, and the percentage change in MSE as well as the purity of the terminal nodes is recorded as a result of these changes. I used the R package randomForest to run a Random Forest regression, using Case Fatality Ratio as the dependent variable and all numerical predictors as independent variables. I grew 500 trees in the Random Forest model and used the default value of sqrt(p) for the number of candidates tried at each split, where p is the number of variables. To assess variable importance, I made a variable importance plot, using the function varImpPlot, plotting variable importance according to both % Increase in MSE and Increase in Node Purity. The first chart below is the variable importance plot that we ran based on all of our potential causal factors.

As with the linear model, we used a 10-fold cross validation, where 80% of our data was used to train the model and 20% was used to test the model. After each fold, we calculated mean average error (MAE) on the testing dataset. In the cross-validated random forest model, the mean MAE value for the 10-fold cross validation was 0.01195932. Additionally, after each fold we calculated the residuals for the testing data and exported these as a numeric vector. We created a frequency histogram and density plot to make sure the residuals were normally distributed with a mean of zero. As we incorporated hs_grads_2017 in our final model, we then created a partial dependence plot for hs_grads_2017 to assess the nature of the relationship between this variable and case fatality ratio. Partial dependence plots work by fixing all variables other than the variable of interest at their mean value. Next, the algorithm iterates through all possible values of the variable of interest (in this case hs_grads_2017) and records the effect on predicted outcome. The resultant plot can be used to determine whether the variable of interest has a positive or negative effect on the response variable.

After comparing MAE and residual plots between the random forests model and the linear model we again concluded that the predictive power of these techniques is remarkably similar for this dataset. We concluded that the 5 variables we selected as being determinants of case fatality ratio produce similar results regardless of the regression algorithm used. To visualize the relationships between our independent variables and case fatality ratio, we calculated 95% confidence intervals for the coefficients for all independent variables and plotted these using a dot and whisker plot. Typically, 95% confidence intervals that are entirely positive indicate a variable that has a positive effect on the dependent variable whereas those that are entirely negative indicate a negative effect on the dependent variable. Those 95% confidence intervals that overlap zero typically cannot be said to have either a positive or negative effect on the dependent variable.

Question2: What factors are responsible for state-level differences in the economic impact of COVID-19?

We obtained state-level GDP data for Q1 2020 and and Q2 2020 and calculated the percent change in GDP between these two quarters. Next, we obtained data on the relative percent of each state’s GDP by industry using 2016 data. We attempted to characterize diversification of a state’s economy, or lack thereof, by reporting the maximum percent contribution to GDP due to a single industry. States that had a high values were highly dependent on a single industry and thus had low diversification. First, we calculated case fatality ratio using the aggregated dataset and dividing number of confirmed deaths by number of confirmed cases and multiplying this result by 100. We created 3 competing linear models to explain variation in state-level changes in GDP. Model 1 used only case fatality ratio as its explanatory variable. Model 2 used only “max_per”, the maximum percentage of state GDP derived from a single industry. Model 3 used percent composition for all industries in a given state. We compared the models using an anova and selected the top model as the one with the highest adjusted R-squared value. Next, we used the predict() function to predict change in GDP for all states, and we compared actual and predicted values visually using a scatterplot.

CONCLUSION

The results of the first follow-up question indicate that explanatory variables including “the percentage of people aged under 5”, “% of population comprised of black racial group”, “% of population comprised of native racial group”, “% of population who earned high-school degree”, and “the number of persons per household” have the highest predictive power for the COVID-19 case fatality ratio. From the 95% confidence intervals, we found younger population and both black and native racial groups to have a positive correlation with the case fatality ratio. In contrast, higher educational attainment, defined as the percentage of the population with a high school diploma, was negative correlated with case fatality ratio. The only counter-intuitive result we concluded here is the negative correlation between the percentage of persons per household and case fatality ratio. These results are invaluable to both the local and national disease control agencies. For example, CDC could use the results from our first question in delivering more effective guidelines to control the nation’s case fatality ratio by first filtering out states or even counties with the highest percentage of black and native population, and then enforce more effective quarantine or COVID-testing policies on those states/counties. On the other hand, in regions possessing high educational attainment, the corresponding states should encourage more highly educated people to distribute the scientific disease control methods in their households. As for the last factor we indicated in the final model, we should be mindful if the household size is relatively small, which may indicate less likelihood for small households to conduct the necessary COVID-testing as their impacts to county case fatality ratio is relatively minimal.

Future researchers may wish to follow up on our analysis by digging deeper into the factors that make states and counties more or less susceptible to COVID-19 outbreaks. We were unable to obtain high-quality information on the prevalence and efficacy of measures intended to combat the spread of COVID-19, yet it is well known that measures such as mask mandates and lockdowns vary widely by jurisdiction. A detailed time-series dataset with start and end dates for each COVID mitigation mandate would provide valuable information on the types of measures most likely to stop the spread of this virus.

In addition, The results for the second follow-up question indicate that states with greater dependence on a single industry are more likely to suffer economically as a result of COVID-19. From model 3 we conclude that the industry makeup of a state’s economy is a good indicator of the severity of the changes in GDP resulting from COVID-19. This question has real world implications for economies suffering under the impacts of COVID-19. As countries recover from COVID-19 and enter a post-COVID economy, they will have to make difficult decisions about how to allocate resources to prop up struggling economies. Our results show that the industry makeup of individual states are strongly predictive of the economic impact felt by those states due to COVID-19. Governments may use these results to decide how to allocate resources to different industry sectors to help their economies recover. Our second analysis focused on state-level changes to GDP and the industries that contribute to GDP in each of these states. Future research should attempt to understand these trends at a higher resolution. If future researchers can relate changes in GDP to industry at a county level it would help policy-makers understand where and in which sectors the most devastating effects of COVID-19 are being felt. Such an analysis would require county-level GDP and industry data, rather than the state-level data that we used in our analysis.

Moving forward, more research is needed in order to uncover the relationship between county level predictors and case fatality ratio. Certain demographic factors, such as the percentage of black or Native American people in a county, are highly correlated with case fatality ratio. However, it is beyond the scope of our analysis to ask why these factors contribute to case fatality ratio. Future researchers could attempt to understand why these populations suffer a disproportionate mortality rate with COVID-19.