Consider the dataset Wages1
from the Ecdat
package.
## exper sex school wage
## 1 9 female 13 6.315296
## 2 12 female 12 5.479770
## 3 11 female 11 3.642170
## 4 9 female 14 4.593337
## 5 8 female 14 2.418157
## 6 9 female 14 2.094058
This observational dataset records the years experienced, the years schooled, the sex, and the hourly wage for 3,294 workers. A Guide to Modern Econometrics by Marno Verbeek utilizes this data in a linear regression context. According to Marno Verbeek, this data is a subsample from the US National Longitudinal Study.
The purpose of this tutorial is to practice the creative process in
exploratory data analysis of asking questions and then investigating
those questions using visuals and statistical summaries. It is your job
to apply your detective skills to the information hidden in this data.
For future use, utilize the modified datasetwage
according
to the R code below:
wage=as_tibble(Wages1) %>%
rename(experience=exper) %>%
arrange(school)
head(wage)
## # A tibble: 6 × 4
## experience sex school wage
## <int> <fct> <int> <dbl>
## 1 18 male 3 5.52
## 2 15 male 4 3.56
## 3 18 male 4 9.10
## 4 10 female 5 0.603
## 5 11 male 5 3.80
## 6 14 male 5 7.50
geom_bar()
to investigate the distribution
of level of experience found in wage
.ggplot(DATA) +
geom_bar(aes(x=VARIABLE))
group_by(experience)
along with the pipe
%>%
to output the most common amount of years of
experience along with the number of occurrences found in the data. The
most common value for years of experience is _____ and occurs _____
times.wage %>%
group_by(experience) %>%
summarize(COMPLETE,.groups='drop') %>%
arrange(COMPLETE) %>%
summarize(common.exp=first(experience),common.n=first(n),.groups='drop')
geom_bar()
to visualize the overall
distribution of level of schooling found in the data.ggplot(DATA) +
geom_bar(aes(x=VARIABLE))
DATA %>%
group_by(VARIABLE) %>%
summarize(n=n()) %>%
arrange(desc(VARIABLE)) %>%
summarize(max.school=first(VARIABLE),
max.n=first(n))
geom_point()
to display a scatter plot representing
the relationship between these two discrete numeric variables. Consider
using alpha=0.1
to indicate where the relationship is
represented the best.ggplot(DATA) +
geom_point(aes(x=VARIABLE1,y=VARIABLE2),
alpha=ALPHA_VALUE,shape=16,size=2)
The years of experience seem to _____ (increase/decrease) as the years of schooling increases. Is this what you expected to see? ____ (yes/no).
Practically, what reasons do you hypothesize for this observed relationship? (Discuss within the group. Do not have to provide answers in the submission.)
geom_freqpoly()
to compare the distribution of wage
of females to the distribution o fwage of males. Where do these
distributions look the same and where do they differ.ggplot(DATA)+
geom_TYPE(aes(x=VARIABLE1,color=VARIABLE2))
group_by()
along with summarize to report the mean
wage
, standard error of wage
, and 95%
confidence interval for the unknown population mean hourly wage for the
various levels of sex
. The standard error is equal to the
standard deviation divided by the square root of the sample size. The
95% confidence interval is approximated by obtaining the lower and upper
bound of an interval within 2 standard errors of the sample mean.DATA %>%
group_by(VARIABLE1) %>%
summarize(n=n(),mean=mean(VARIABLE2),se=sd(VARIABLE2)/sqrt(n),
lb=mean-2*se,ub=mean+2*se)
Based on the confidence limits, do we have statistical evidence to say that the average hourly wage for men was larger than the average hourly wage for women? ______ (yes/no).
How would you explain your answer in terms of the confidence intervals that are constructed above?____________________________
geom_point()
along with the option
color=sex
to overlay scatter plots. Does there seem to be a
clear distinction between female and male regarding this relationship?
______ (yes/no).ggplot(data=DATA) +
geom_point(aes(x=VARIABLE1,y=VARIABLE2,color=VARIABLE3))
x=experience
with x=school
. Does there seem to
be a clear distinction between female and male regarding this
relationship? ______ (yes/no).ggplot(data=DATA) +
geom_point(aes(x=VARIABLE1,y=VARIABLE2,color=VARIABLE3))
The graphic below summarizes the average hourly wage for the
different combinations of schooling and experience level. The additional
facet_grid(~sex)
makes comparing the relationship of the
three key numeric variables between the sexes quite easy.
wage %>%
group_by(experience,school,sex) %>%
summarize(n=n(),mean=mean(wage)) %>%
ungroup() %>%
ggplot() +
geom_tile(aes(x=experience,y=school,fill=mean)) +
scale_fill_gradientn(colors=c("black","lightskyblue","white"))+
facet_grid(~sex) + theme_dark()
## `summarise()` has grouped output by 'experience', 'school'. You can override
## using the `.groups` argument.
The next figure is similar to the previous one except that the tile color reflects the standard deviation of wage rather than the mean. Interactions of experience and school levels containing less than or equal to 10 instances are ignored in this image.
wage %>%
group_by(experience,school,sex) %>%
summarize(n=n(),sd=sd(wage)) %>%
ungroup() %>%
filter(n>10) %>%
ggplot() +
geom_tile(aes(x=experience,y=school,fill=sd)) +
scale_fill_gradientn(colors=c("black","lightskyblue","white"))+
facet_grid(~sex) + theme_dark()
## `summarise()` has grouped output by 'experience', 'school'. You can override
## using the `.groups` argument.
Which plot is generally darker and what does that imply?
Specifically for the scenario where a worker has 5 years of experience and 11 years of schooling, what does the extreme contrast between female and male cells imply for this figure?