Consider the dataset heart_disease
from the
funModeling
package.
## age gender chest_pain resting_blood_pressure serum_cholestoral
## 1 63 male 1 145 233
## 2 67 male 4 160 286
## 3 67 male 4 120 229
## 4 37 male 3 130 250
## 5 41 female 2 130 204
## fasting_blood_sugar resting_electro max_heart_rate exer_angina oldpeak slope
## 1 1 2 150 0 2.3 3
## 2 0 2 108 1 1.5 2
## 3 0 2 129 1 2.6 2
## 4 0 0 187 0 3.5 3
## 5 0 2 172 0 1.4 1
## num_vessels_flour thal heart_disease_severity exter_angina has_heart_disease
## 1 0 6 0 0 no
## 2 3 3 2 1 yes
## 3 2 7 1 1 yes
## 4 0 3 0 0 no
## 5 0 3 0 0 no
There are variables related to patient clinic trial.
heart_disease
is a data frame with 303 rows and 16
variables. We’ll focus on the following variables in the analysis:
age
: age in years (numerical)
max_heart_rate
: max heart rate per minute
(numerical)
thal
: A blood disorder called thalassemia
(categorical: 3 = normal; 6 = fixed defect; 7 = reversable
defect)
has_heart_disease
: Heart disease (categorical: no,
yes)
gender
: gender of patient (categorical: male,
female)
The purpose of this lab is to practice the creative process in
exploratory data analysis of asking questions and then investigating
those questions using visuals and statistical summaries. It is your job
to apply your detective skills to the information hidden in this data.
For future use, utilize the modified dataset heart
according to the R code below:
heart=as_tibble(heart_disease) %>%
select(age, max_heart_rate, thal, has_heart_disease, gender)
head(heart)
## # A tibble: 6 × 5
## age max_heart_rate thal has_heart_disease gender
## <int> <int> <fct> <fct> <fct>
## 1 63 150 6 no male
## 2 67 108 3 yes male
## 3 67 129 7 yes male
## 4 37 187 3 no male
## 5 41 172 3 no female
## 6 56 178 3 no male
When you get the desired result for each step, change
Eval=F
to Eval=T
and knit the document to HTML
to make sure it works. After you complete the lab, you should submit
your HTML file of what you have completed to Canvas before the
deadline.
geom_histogram()
to investigate the
distribution of age
found in heart
.ggplot(DATA) +
geom_histogram(aes(x=VARIABLE))
group_by(age)
along with the pipe
%>%
to output the most common age along with the number
of patients of that age. The most common value for age is _____ and the
number of patients of the age is _____.heart %>%
group_by(age) %>%
summarise(n=n(),.groups='drop') %>%
arrange(desc(n)) %>%
summarise(common.exp=first(age),common.n=first(n),.groups='drop')
## # A tibble: 1 × 2
## common.exp common.n
## <int> <int>
## 1 58 19
geom_density()
to visualize the overall
distribution of max heart rate.ggplot(DATA) +
geom_TYPE(aes(x=VARIABLE))
max_heart_rate
and the number of patients in the data that
had that max heart rate. The maximum max_heart_rate
was
____ which occurred ____ times in our sampleDATA %>%
group_by(VARIABLE) %>%
summarise(n=n(),.groups='drop') %>%
arrange(desc(VARIABLE)) %>%
summarise(max.max_heart_rate=first(VARIABLE),
max.n=first(n),.groups='drop')
geom_point()
to display a scatter plot representing
the relationship between these two numeric variables. Use
geom_smooth()
to display a linear regression line to show
the relationship between them.ggplot(DATA) +
geom_point(aes(x=VARIABLE1,y=VARIABLE2),
alpha=ALPHA_VALUE,shape=16,size=2) +
geom_smooth(aes(x=VARIABLE1,y=VARIABLE2),method=METHOD)
The max heart rate seems to _____ (increase/decrease) as the age of patients increases. Is this what you expected to see? ____ (yes/no).
Practically, what reasons do you hypothesize for this observed relationship?
geom_boxplot()
to compare the distribution of max
heart rate of patients who have heart disease to the distribution of max
heart rate of patients who do not have heart disease.ggplot(DATA)+
geom_TYPE(aes(x=VARIABLE1,y = VARIABLE2))
group_by()
along with summarize to report the mean
max_heart_rate
, standard error of
max_heart_rate
, and 95% confidence interval for the unknown
population mean of max_heart_rate
for the various levels of
has_heart_disease
. The standard error is equal to the
standard deviation divided by the square root of the sample size. The
95% confidence interval is approximated by obtaining the lower and upper
bound of an interval within 2 standard errors of the sample mean.DATA %>%
group_by(VARIABLE1) %>%
summarise(n=n(),mean=mean(VARIABLE2),se=sd(VARIABLE2)/sqrt(n),
lb=mean-2*se,ub=mean+2*se,.groups='drop')
Based on the confidence limits, do we have statistical evidence
to say that the average max_heart_rate
for patients who do
not have heart disease was larger than the average
max_heart_rate
for patients who have heart disease? _____
(yes/no).
How would you explain your answer in terms of the confidence intervals that are constructed above?______________
geom_point()
along with the option
color=has_heart_disease
to overlay scatter plots. Does
there seem to be a clear distinction between groups of have and do not
have heart disease regarding this relationship? ____ (yes/no).ggplot(data=DATA) +
geom_point(aes(x=VARIABLE1,y=VARIABLE2,color=VARIABLE3))
color=has_heart_disease
with color=gender
.
Does there seem to be a clear distinction between female and male
regarding this relationship? ____ (yes/no).ggplot(data=DATA) +
geom_point(aes(x=VARIABLE1,y=VARIABLE2,color=VARIABLE3))
Generate heatmap to summarize the average max heart rate for the
different combinations of gender and thal. Use
facet_grid(~has_heart_rate)
to compare the relationship of
the three variables between the patients who have and do not have heart
disease quite easy.
na.omit(heart) %>%
group_by(VARIABLE1,VARIABLE2,VARIABLE3) %>%
summarise(n=n(),mean=mean(VARIABLE4),.groups='drop') %>%
ggplot() +
geom_tile(aes(x=VARIABLE1,y=VARIABLE2,fill=mean)) +
scale_fill_gradientn(colors=c("black","lightskyblue","white"))+
facet_grid(~VARIABLE3) + theme_dark()
ANSWER:__________________
The next figure is similar to the previous one except that the tile color reflects the standard deviation of max heart rate rather than the mean. Interactions of gender and thal containing less than or equal to 10 instances are ignored in this image.
na.omit(heart) %>%
group_by(VARIABLE1,VARIABLE2,VARIABLE3) %>%
summarise(n=n(),sd=sd(VARIABLE4),.groups='drop') %>%
filter(n>10) %>%
ggplot() +
geom_tile(aes(x=VARIABLE1,y=VARIABLE2,fill=FILL)) +
scale_fill_gradientn(colors=c("black","lightskyblue","white"))+
facet_grid(~VARIABLE3) + theme_dark()
ANSWER:__________________
ANSWER:__________________