The Cleveland school district is made up of 20 high schools uniquely identified by numbers 1 to 20. Within each high school, a random sample of 20 students are selected and are uniquely identified by numbers 1 to 20. All 400 students selected are assessed on their mathematics skills based on district designed standardized tests in the years 2017 and 2018. The scores and corresponding percentiles of these selected students for both years are simulated in the following R code.
school.id=rep(1:20,each=20*2)
student.id=rep(rep(1:20,each=2),20)
type=rep(c("Score","Percentile"),20*20)
score2017=round(rnorm(20*20,mean=50,sd=10),0)
percentile2017=round(100*pnorm(score2017,mean=mean(score2017),sd=sd(score2017)),0)
score2018=round(rnorm(20*20,mean=75,sd=4),0)
percentile2018=round(100*pnorm(score2018,mean=mean(score2018),sd=sd(score2018)),0)
value2017=c(rbind(score2017,percentile2017))
value2018=c(rbind(score2018,percentile2018))
untidy.school = tibble(
school=school.id,
student=student.id,
type=type,
value2017=value2017,
value2018=value2018) %>%
filter(!(school==1 & student==4)) %>% filter(!(school==12 & student==18)) %>%
mutate(value2018=ifelse((school==1 & student==3)|(school==15 & student==18)|
(school==5 & student==12),NA,value2018))
The following table provides a preview of the first 10 rows of the simulated data.
head(untidy.school,10)
## # A tibble: 10 × 5
## school student type value2017 value2018
## <int> <int> <chr> <dbl> <dbl>
## 1 1 1 Score 41 70
## 2 1 1 Percentile 16 10
## 3 1 2 Score 30 76
## 4 1 2 Percentile 2 62
## 5 1 3 Score 48 NA
## 6 1 3 Percentile 39 NA
## 7 1 5 Score 49 78
## 8 1 5 Percentile 43 79
## 9 1 6 Score 35 68
## 10 1 6 Percentile 5 4
The data is not recorded data in a format that is immediately usable.
Using our understanding of the tidyr package, we can easily
convert this table into a form that is useful for data analysis.
The variable school uniquely identifies the school, but
the variable student only uniquely identifies the student
within the school. The problem is best illustrated by the
filter() function in dplyr.
untidy.school %>% filter(student==1) %>% head(4)
## # A tibble: 4 × 5
## school student type value2017 value2018
## <int> <int> <chr> <dbl> <dbl>
## 1 1 1 Score 41 70
## 2 1 1 Percentile 16 10
## 3 2 1 Score 60 71
## 4 2 1 Percentile 83 16
The subsetted table contains scores and percentiles for two
completely different children identified by student==1. We
need to create a unique identifier for each student in the Cleveland
school district. The unite() function can be utilized to
create a new variable called CID by concatenating the
identifiers for school and student. We want
CID to follow the general form SCHOOL.STUDENT.
Create a new tibble called untidy2.school that fixes this
problem without dropping the original variables school or
student. Read the documentation for unite()
either by searching on google or using ?unite to prevent
the loss of original variables in the creation of a new variable.
untidy2.school = untidy.school %>%
unite(CID,school,student,sep='.', remove=F)
glimpse(untidy2.school)
## Rows: 796
## Columns: 6
## $ CID <chr> "1.1", "1.1", "1.2", "1.2", "1.3", "1.3", "1.5", "1.5", "1.6…
## $ school <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ student <int> 1, 1, 2, 2, 3, 3, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, …
## $ type <chr> "Score", "Percentile", "Score", "Percentile", "Score", "Perc…
## $ value2017 <dbl> 41, 16, 30, 2, 48, 39, 49, 43, 35, 5, 58, 78, 57, 74, 57, 74…
## $ value2018 <dbl> 70, 10, 76, 62, NA, NA, 78, 79, 68, 4, 74, 41, 75, 52, 76, 6…
The variables value2017 and value2018
contain the scores and percentiles for two different years. In a new
tibble called untidy3.school, based on
untidy2.school, we want to create a new variable called
Year and a new variable called Value that
display the year and the result from that year, respectively. The
variable Year should be a numeric vector containing either
2017 or 2018. The most efficient way to modify the
data in this manner is to start by renaming value2017 and
value2018 to nonsynctactic names 2017 and
2018. Remember that you need to surround nonsyncactic names
with backticks to achieve this result.
untidy3.school = untidy2.school %>%
rename('2017'=value2017,'2018'=value2018) %>%
gather('2017':'2018',key="Year",value="Value",convert=T)
#pivot_longer('2017':'2018',names_to='Year',values_to='Value') %>%
#mutate(Year=as.integer(Year))
glimpse(untidy3.school)
## Rows: 1,592
## Columns: 6
## $ CID <chr> "1.1", "1.1", "1.2", "1.2", "1.3", "1.3", "1.5", "1.5", "1.6",…
## $ school <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ student <int> 1, 1, 2, 2, 3, 3, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11…
## $ type <chr> "Score", "Percentile", "Score", "Percentile", "Score", "Percen…
## $ Year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20…
## $ Value <dbl> 41, 16, 30, 2, 48, 39, 49, 43, 35, 5, 58, 78, 57, 74, 57, 74, …
The variable type in untidy3.school
indicates that two completely different variables are contained in the
recently created variable called Value. Both the scores and
percentiles of students are contained in Value. Using the
function spread() we can create two new variables,
Score and Percentile, that display the
information contained in Value in separate columns. Using
untidy3.school, create a new tibble called
tidy.school that accomplishes these tasks.
tidy.school = untidy3.school %>%
pivot_wider(names_from=type,values_from=Value)
# spread(key=type,value=Value)
glimpse(tidy.school)
## Rows: 796
## Columns: 6
## $ CID <chr> "1.1", "1.2", "1.3", "1.5", "1.6", "1.7", "1.8", "1.9", "1.…
## $ school <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ student <int> 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ Year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,…
## $ Score <dbl> 41, 30, 48, 49, 35, 58, 57, 57, 55, 54, 54, 40, 62, 37, 41,…
## $ Percentile <dbl> 16, 2, 39, 43, 5, 78, 74, 74, 67, 63, 63, 13, 88, 8, 16, 81…
The original data contains explicitly missing and implicitly missing
values. Instances of both can be visibly seen in the first ten
observations. Below is a table showing the first 10 observations in the
cleaned dataset we called tidy.school. To appropriately,
view this we have to sort our observations by school and
student as seen in the original dataset
untidy.school.
## # A tibble: 10 × 6
## CID school student Year Score Percentile
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 1.1 1 1 2017 41 16
## 2 1.1 1 1 2018 70 10
## 3 1.2 1 2 2017 30 2
## 4 1.2 1 2 2018 76 62
## 5 1.3 1 3 2017 48 39
## 6 1.3 1 3 2018 NA NA
## 7 1.5 1 5 2017 49 43
## 8 1.5 1 5 2018 78 79
## 9 1.6 1 6 2017 35 5
## 10 1.6 1 6 2018 68 4
Based on the table above, you can see that student 3 from school 1 has a missing score and percentile for the year 2018. This is an example of explicitly missing information.
Based on the table above, you can see that student 4 from school 1 is clearly missing scores and percentiles from both years 2017 and 2018. This is an example of implicitly missing information.
Use the complete() function to convert all implicitly
missing to explicitly missing. Create a new table called
tidy2.school that reports missing values as NA
for all combinations of school, student, and year.
tidy2.school=tidy.school %>%
complete(school, student, Year)
The first 10 rows of tidy2.school are displayed
below.
tab.tidy2.school = tidy2.school %>%
head(10)
tab.tidy2.school
If you inspect the first 10 rows of tidy2.school, you
should see that the variable CID is missing for student
4 from school 1 even though we know that this students
unique district ID should be “1.4”. Using the pipe
%>%, combine all previous statements in an order where
this will not occur. Create a tibble named
final.tidy.school using a chain of commands that begins
with calling the original tibble untidy.school
final.tidy.school = untidy.school %>%
rename('2017'=value2017,'2018'=value2018) %>%
gather('2017':'2018',key="Year",value="Value",convert=T) %>%
pivot_wider(names_from=type,values_from=Value) %>%
complete(school, student, Year) %>%
unite(CID,school,student,sep='.', remove=F)
The figure below uses boxplots to show the distribution of scores in the 20 schools for the years 2017 and 2018. How would you interpret it?
ggplot(final.tidy.school) +
geom_boxplot(aes(x=as.factor(Year),y=Score,fill=as.factor(school))) +
guides(fill=F)+
theme_minimal()
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Using different colors for each student, the next two pictures show the change in test scores and percentiles for all students (without missing values) sampled from the district. Both of these pictures are necessary in understanding the improvement in mathematical knowledge on the student level. As you can see, they are very different from each other. Hypothesize a reason that would have caused this phenomenon to occur.
ggplot(final.tidy.school) +
geom_line(aes(x=Year,y=Score,color=as.factor(CID))) +
guides(color=F) +
scale_x_discrete(breaks=c(2017,2018),labels=c(2017,2018)) +
theme_minimal()
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).
ggplot(final.tidy.school) +
geom_line(aes(x=Year,y=Percentile,color=as.factor(CID))) +
guides(color=F) +
scale_x_discrete(breaks=c(2017,2018),labels=c(2017,2018)) +
theme_minimal()
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).