Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis.
Helpful: Make sure you knit the document as you go through the assignment. Check all your results in the created HTML file.
Submission: Submit via Canvas. Must be submitted as an HTML file generated in RStudio.
Does high cholesterol lead to high crime rates? Probably not. This
data analysis assignment is separated into two parts which cover
material from the lectures on tidy data and joins. In Part 1, you will
demonstrate the basic concept of joins by connecting relational data
involving a cholesterol study. For this segment,
pivot_longer and pivot_wider will be utilized
to create a single tidy dataset ready for analysis. In Part 2, we will
join 5 datasets.
# Data for Part 1
CHOL1=read_csv("Cholesterol.csv")
CHOL2=read_csv("Cholesterol2.csv")
# Data for Part 2
VIOLENT=read_csv("FINAL_VIOLENT.csv")
ZIP=read_csv("FINAL_ZIP.csv")
STATE_ABBREV=read_csv("FINAL_STATE_ABBREV.csv")
CENSUS=read_csv("FINAL_CENSUS.csv")
S_VS_D=read_csv("FINAL_SAFE_VS_DANGEROUS.CSV")The data frame CHOL1 contains experimental results from
randomly assigning 18 people to one of two competing margarine
brands “A” and “B”. The cholesterol of these patients was measured once
before using the margarine brand, once after 4 weeks with the
margarine brand, and then again after 8 weeks with the
margarine brand. Researchers want to see if there is benefit of these
brands of margarine on reducing an individual’s cholesterol and want to
determine if there is a statistically significant difference between the
two competing brands.
Start by examing the tables CHOL1 and CHOL2
and answering the following questions with Yes or No
responses.
Is the variable ID in CHOL1 a primary
key?
Answer (1 Point): ANSWER_HERE
Is the variable, Margarine in CHOL1 a
primary key?
Answer (1 Point): ANSWER_HERE
Is the variable, Brand in CHOL2 a primary
key?
Answer (1 Point): ANSWER_HERE
In a new data frame called CHOL1a based on
CHOL1, rename the variables After4weeks and
After8weeks to nonsynctactic variable names 4
and 8, respectively. Use names(CHOL1a) to
display this modification.
#Use the pivot_longer() function or gather()
function on CHOL1a to create a new numeric variable called
Week that contains numeric values 4 or 8
and a new numeric variable called Response that contains
the Cholesterol after the corresponding number of weeks. Create a new
data frame called CHOL1b with these modifications and use
str(Chol1b) to show that both variables have been created
correctly and are indeed numeric (an integer variable is a specific type
of numeric variable).
#Now working with CHOL2, we want to spread the variable
Statistic across multiple columns. Do this in a new data
frame called CHOL2a and use print(CHOL2a) to
display the modified complete table.
#Start by examing the tables CHOL1b and
CHOL2a and answering the following questions with
Yes or No responses.
Is the variable ID in CHOL1b a primary
key?
Answer (1 Point): ANSWER_HERE
Is the variable, Margarine in CHOL1b a
primary key?
Answer (1 Point): ANSWER_HERE
Is the variable, Brand in CHOL2a a primary
key?
Answer (1 Point): ANSWER_HERE
Get the nutritional facts of the different margarine brands in
CHOL2a into the experimental results found in
CHOL1b using a join. Create a new data frame named
CHOL.COMBINED and display the table using
head(CHOL.COMBINED). This final data frame should contain
36 observations and 10 variables.
#In the zipped folder, there are 5 CSV files. In this
section, we are going to merge all of that data into one object called
FINAL.VIOLENT.
The dataset S_VS_D contains a variable
CLASS where “S=Safe” and “D=Dangerous” according to the
article These
Are the 2018 Safest and Most Dangerous States in the U.S by Steve
Karantzoulidis. We seek to compare the violent crime statistics for
states not in this list. Use a filtering join to create a new data frame
called VIOLENT2 that only contains violent crime statistics
from the states not represented in the data frame S_VS_D.
Use str(VIOLENT2) to display the variables and the
dimensions of VIOLENT2.
#Start by creating a new data set called VIOLENT3 based
on VIOLENT2 that fixes some problems in the variable
City. Specifically, we would like to change “Louisville
Metro” to “Louisville”. Next, create a new data frame named
VIOLENT4 that connects the population change and density
measures from 2019 contained in CENSUS to the cities and
states in VIOLENT3. Use head(VIOLENT4) to give
a preview of the new merged dataset.
Finally, in a complete sentence, identify any location(s) (Cities and States) missing violent crime information.
Code and Output (4 Points):
#Answer (4 Points): ANSWER_HERE
Either ambitiously using one step or less-ambitiously using multiple
steps add the longitude and latitude information provided in
ZIP to the cities and states in VIOLENT4. You
will need to use STATE_ABBREV data to link these two data
frames. Your final data frame named FINAL.VIOLENT should
contain all of the information in VIOLENT4 along with the
variables lat and lon from ZIP.
There should be no state abbreviations in
FINAL.VIOLENT since this information is redundant. Use
str(FINAL.VIOLENT) to demonstrate that everything worked as
planned.
In FINAL.VIOLENT identify what cities are missing
latitude and longitude. Closely, inspect both the ZIP and
VIOLENT4 data frames. Report the location(s) missing
geographical information and explain in complete sentences why this
happened.
Finally, challenge yourself and attempt to fix this problem in a new
data frame called FINAL.VIOLENT.FIX. Use a combination of
str() and filter() to only display the data in
FINAL.VIOLENT.FIX for the location(s) that
FINAL.VIOLENT was missing latitude and longitude. Do this
in the second code chunk below.
Code and Output (6 Points):
#Answer (2 Points): ANSWER_HERE
Code and Output (2 Point):
#