Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis.
Helpful: Make sure you knit the document as you go through the assignment. Check all your results in the created HTML file.
Submission: Submit via Canvas. Must be submitted as an HTML file generated in RStudio.
Does high cholesterol lead to high crime rates? Probably not. This
data analysis assignment is separated into two parts which cover
material from the lectures on tidy data and joins. In Part 1, you will
demonstrate the basic concept of joins by connecting relational data
involving a cholesterol study. For this segment,
pivot_longer
and pivot_wider
will be utilized
to create a single tidy dataset ready for analysis. In Part 2, we will
join 5 datasets.
# Data for Part 1
CHOL1=read_csv("Cholesterol.csv")
CHOL2=read_csv("Cholesterol2.csv")
# Data for Part 2
VIOLENT=read_csv("FINAL_VIOLENT.csv")
ZIP=read_csv("FINAL_ZIP.csv")
STATE_ABBREV=read_csv("FINAL_STATE_ABBREV.csv")
CENSUS=read_csv("FINAL_CENSUS.csv")
S_VS_D=read_csv("FINAL_SAFE_VS_DANGEROUS.CSV")
The data frame CHOL1
contains experimental results from
randomly assigning 18 people to one of two competing margarine
brands “A” and “B”. The cholesterol of these patients was measured once
before using the margarine brand, once after 4 weeks with the
margarine brand, and then again after 8 weeks with the
margarine brand. Researchers want to see if there is benefit of these
brands of margarine on reducing an individual’s cholesterol and want to
determine if there is a statistically significant difference between the
two competing brands.
Start by examing the tables CHOL1
and CHOL2
and answering the following questions with Yes or No
responses.
Is the variable ID
in CHOL1
a primary
key?
Answer (1 Point): ANSWER_HERE
Is the variable, Margarine
in CHOL1
a
primary key?
Answer (1 Point): ANSWER_HERE
Is the variable, Brand
in CHOL2
a primary
key?
Answer (1 Point): ANSWER_HERE
In a new data frame called CHOL1a
based on
CHOL1
, rename the variables After4weeks
and
After8weeks
to nonsynctactic variable names 4
and 8
, respectively. Use names(CHOL1a)
to
display this modification.
#
Use the pivot_longer()
function or gather()
function on CHOL1a
to create a new numeric variable called
Week
that contains numeric values 4 or 8
and a new numeric variable called Response
that contains
the Cholesterol after the corresponding number of weeks. Create a new
data frame called CHOL1b
with these modifications and use
str(Chol1b)
to show that both variables have been created
correctly and are indeed numeric (an integer variable is a specific type
of numeric variable).
#
Now working with CHOL2
, we want to spread the variable
Statistic
across multiple columns. Do this in a new data
frame called CHOL2a
and use print(CHOL2a)
to
display the modified complete table.
#
Start by examing the tables CHOL1b
and
CHOL2a
and answering the following questions with
Yes or No responses.
Is the variable ID
in CHOL1b
a primary
key?
Answer (1 Point): ANSWER_HERE
Is the variable, Margarine
in CHOL1b
a
primary key?
Answer (1 Point): ANSWER_HERE
Is the variable, Brand
in CHOL2a
a primary
key?
Answer (1 Point): ANSWER_HERE
Get the nutritional facts of the different margarine brands in
CHOL2a
into the experimental results found in
CHOL1b
using a join. Create a new data frame named
CHOL.COMBINED
and display the table using
head(CHOL.COMBINED)
. This final data frame should contain
36 observations and 10 variables.
#
In the zipped folder, there are 5 CSV files. In this
section, we are going to merge all of that data into one object called
FINAL.VIOLENT
.
The dataset S_VS_D
contains a variable
CLASS
where “S=Safe” and “D=Dangerous” according to the
article These
Are the 2018 Safest and Most Dangerous States in the U.S by Steve
Karantzoulidis. We seek to compare the violent crime statistics for
states not in this list. Use a filtering join to create a new data frame
called VIOLENT2
that only contains violent crime statistics
from the states not represented in the data frame S_VS_D
.
Use str(VIOLENT2)
to display the variables and the
dimensions of VIOLENT2
.
#
Start by creating a new data set called VIOLENT3
based
on VIOLENT2
that fixes some problems in the variable
City
. Specifically, we would like to change “Louisville
Metro” to “Louisville”. Next, create a new data frame named
VIOLENT4
that connects the population change and density
measures from 2019 contained in CENSUS
to the cities and
states in VIOLENT3
. Use head(VIOLENT4)
to give
a preview of the new merged dataset.
Finally, in a complete sentence, identify any location(s) (Cities and States) missing violent crime information.
Code and Output (4 Points):
#
Answer (4 Points): ANSWER_HERE
Either ambitiously using one step or less-ambitiously using multiple
steps add the longitude and latitude information provided in
ZIP
to the cities and states in VIOLENT4
. You
will need to use STATE_ABBREV
data to link these two data
frames. Your final data frame named FINAL.VIOLENT
should
contain all of the information in VIOLENT4
along with the
variables lat
and lon
from ZIP
.
There should be no state abbreviations in
FINAL.VIOLENT
since this information is redundant. Use
str(FINAL.VIOLENT)
to demonstrate that everything worked as
planned.
In FINAL.VIOLENT
identify what cities are missing
latitude and longitude. Closely, inspect both the ZIP
and
VIOLENT4
data frames. Report the location(s) missing
geographical information and explain in complete sentences why this
happened.
Finally, challenge yourself and attempt to fix this problem in a new
data frame called FINAL.VIOLENT.FIX
. Use a combination of
str()
and filter()
to only display the data in
FINAL.VIOLENT.FIX
for the location(s) that
FINAL.VIOLENT
was missing latitude and longitude. Do this
in the second code chunk below.
Code and Output (6 Points):
#
Answer (2 Points): ANSWER_HERE
Code and Output (2 Point):
#