Introduction

Consider the dataset Wages1 from the Ecdat package.

##   exper    sex school     wage
## 1     9 female     13 6.315296
## 2    12 female     12 5.479770
## 3    11 female     11 3.642170
## 4     9 female     14 4.593337
## 5     8 female     14 2.418157
## 6     9 female     14 2.094058

This observational dataset records the years experienced, the years schooled, the sex, and the hourly wage for 3,294 workers. A Guide to Modern Econometrics by Marno Verbeek utilizes this data in a linear regression context. According to Marno Verbeek, this data is a subsample from the US National Longitudinal Study.

The purpose of this tutorial is to practice the creative process in exploratory data analysis of asking questions and then investigating those questions using visuals and statistical summaries. It is your job to apply your detective skills to the information hidden in this data. For future use, utilize the modified datasetwage according to the R code below:

wage=as.tibble(Wages1) %>%
  rename(experience=exper) %>%
  arrange(school)

## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

head(wage)

## # A tibble: 6 × 4
##   experience sex    school  wage
##        <int> <fct>   <int> <dbl>
## 1         18 male        3 5.52 
## 2         15 male        4 3.56 
## 3         18 male        4 9.10 
## 4         10 female      5 0.603
## 5         11 male        5 3.80 
## 6         14 male        5 7.50

Part 1: Questions About Variation

Question 1: What is the most common number of years of experience found in the data?

First, use geom_bar() to investigate the distribution of level of experience found in wage.

ggplot(wage) +
  geom_bar(aes(x=experience))

Use group_by(experience) along with the pipe %>% to output the most common amount of years of experience along with the number of occurrences found in the data. The most common value for years of experience is 9 and occurs 654 times.

wage %>%
  group_by(experience) %>%
  summarize(n=n()) %>%
  arrange(desc(n)) %>%
  summarize(common.exp=first(experience),common.n=first(n))

## # A tibble: 1 × 2
##   common.exp common.n
##        <int>    <int>
## 1          9      654

Question 2: What is the maximum number for years of schooling found in the data?

First, use geom_bar() to visualize the overall distribution of level of schooling found in the data.

ggplot(wage) +
  geom_bar(aes(x=school))

Next, modify the code in Question 1 to display the maximum level of schooling and the number of workers in the data that had that number of schooling. The maximum number of years in school was 16 years which occurred 16 times in our sample

wage %>%
  group_by(school) %>%
  summarize(n=n()) %>%
  arrange(desc(school)) %>% 
  summarize(max.school=first(school),
            max.n=first(n))

## # A tibble: 1 × 2
##   max.school max.n
##        <int> <int>
## 1         16    16

Part 2: Questions about Covariation

Question 3: Is there a relationship between level of schooling and level of experience?

Use geom_point() to display a scatter plot representing the relationship between these two discrete numeric variables. Consider using alpha=0.1 to indicate where the relationship is represented the best.

ggplot(wage) +
  geom_point(aes(x=school,y=experience),
             alpha=0.1,shape=16,size=2)

The years of experience seem to decrease (increase/decrease) as the years of schooling increases. Is this what you expected to see? Yes (yes/no).
Practically, what reasons do you hypothesize for this observed relationship? (Discuss within the group. Do not have to provide answers in the submission.) In general, people with more working experience drop out early to work. People with more schooling work later.

Question 4: How do hourly wages differ between males and females?

Use geom_freqpoly() to compare the distribution of wage of females to the distribution o fwage of males. Where do these distributions look the same and where do they differ.

ggplot(wage)+
  geom_freqpoly(aes(x=wage,color=sex))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Use group_by() along with summarize to report the mean wage, standard error of wage, and 95% confidence interval for the unknown population mean hourly wage for the various levels of sex. The standard error is equal to the standard deviation divided by the square root of the sample size. The 95% confidence interval is approximated by obtaining the lower and upper bound of an interval within 2 standard errors of the sample mean.

wage %>% 
  group_by(sex) %>%
  summarize(n=n(),mean=mean(wage),se=sd(wage)/sqrt(n),
            lb=mean-2*se,ub=mean+2*se)

## # A tibble: 2 × 6
##   sex        n  mean     se    lb    ub
##   <fct>  <int> <dbl>  <dbl> <dbl> <dbl>
## 1 female  1569  5.15 0.0726  5.00  5.29
## 2 male    1725  6.31 0.0842  6.14  6.48

Based on the confidence limits, do we have statistical evidence to say that the average hourly wage for men was larger than the average hourly wage for women? yes (yes/no).
How would you explain your answer in terms of the confidence intervals that are constructed above?_____The lb of male average wage is greater than the up of female average wage_________

Question 5: Does the relationship between hourly wage and years of experience differ between the sexes?

Use geom_point() along with the option color=sex to overlay scatter plots. Does there seem to be a clear distinction between female and male regarding this relationship? yes (yes/no).

ggplot(data=wage) +
  geom_point(aes(x=experience,y=wage,color=sex))

Question 6: Does the relationship between hourly wage and years of schooling differ between the sexes?

Repeat the graphic created in Question 4 replacing x=experience with x=school. Does there seem to be a clear distinction between female and male regarding this relationship? no (yes/no).

ggplot(data=wage) +
  geom_point(aes(x=school,y=wage,color=sex))

Question 7: What is the relationship between hourly wage and the interaction between the years of experience and years of schooling?

The graphic below summarizes the average hourly wage for the different combinations of schooling and experience level. The additional facet_grid(~sex) makes comparing the relationship of the three key numeric variables between the sexes quite easy.

wage %>%
  group_by(experience,school,sex) %>%
  summarize(n=n(),mean=mean(wage)) %>%
  ungroup() %>%
  ggplot() +
    geom_tile(aes(x=experience,y=school,fill=mean)) +
  scale_fill_gradientn(colors=c("black","lightskyblue","white"))+
    facet_grid(~sex) + theme_dark()

## `summarise()` has grouped output by 'experience', 'school'. You can override
## using the `.groups` argument.

What are some differences between the sexes regarding this relationship that are apparent in this chart?

For the same combination of experience and schooling, males tend to earn more.

The next figure is similar to the previous one except that the tile color reflects the standard deviation of wage rather than the mean. Interactions of experience and school levels containing less than or equal to 10 instances are ignored in this image.

wage %>%
  group_by(experience,school,sex) %>%
  summarize(n=n(),sd=sd(wage)) %>%
  ungroup() %>%
  filter(n>10) %>%
  ggplot() +
  geom_tile(aes(x=experience,y=school,fill=sd)) +
  scale_fill_gradientn(colors=c("black","lightskyblue","white"))+
  facet_grid(~sex) + theme_dark()

## `summarise()` has grouped output by 'experience', 'school'. You can override
## using the `.groups` argument.

Which plot is generally darker and what does that imply?

The left one (female) is darker. For each combination of experience and schooling, the variances of wage for female tend to be smaller.

Specifically for the scenario where a worker has 5 years of experience and 11 years of schooling, what does the extreme contrast between female and male cells imply for this figure?

Women with 5 years of experience and 11 years of schooling earn very similar wages. However, men with 5 years of experience and 11 years of schooling can earn very differently.

STOR 320 Tutorial on Exploratory Data Analysis

September 24, 2024