The Cleveland school district is made up of 20 high schools uniquely identified by numbers 1 to 20. Within each high school, a random sample of 20 students are selected and are uniquely identified by numbers 1 to 20. All 400 students selected are assessed on their mathematics skills based on district designed standardized tests in the years 2017 and 2018. The scores and corresponding percentiles of these selected students for both years are simulated in the following R code.
school.id=rep(1:20,each=20*2)
student.id=rep(rep(1:20,each=2),20)
type=rep(c("Score","Percentile"),20*20)
score2017=round(rnorm(20*20,mean=50,sd=10),0)
percentile2017=round(100*pnorm(score2017,mean=mean(score2017),sd=sd(score2017)),0)
score2018=round(rnorm(20*20,mean=75,sd=4),0)
percentile2018=round(100*pnorm(score2018,mean=mean(score2018),sd=sd(score2018)),0)
value2017=c(rbind(score2017,percentile2017))
value2018=c(rbind(score2018,percentile2018))
untidy.school = tibble(
school=school.id,
student=student.id,
type=type,
value2017=value2017,
value2018=value2018) %>%
filter(!(school==1 & student==4)) %>% filter(!(school==12 & student==18)) %>%
mutate(value2018=ifelse((school==1 & student==3)|(school==15 & student==18)|
(school==5 & student==12),NA,value2018))
Below is an HTML table generated using the xtable
package in R. For more information regarding this package, see the xtable gallery. The R code in the code chunk converts an R data frame object to an HTML table. HTML table attributes can be specified within the function print()
. The code chunk option echo=F
prevents the code from showing and the option results="asis"
ensures that the resulting HTML table is displayed when knitted to HTML. The table provides a preview of the first 10 rows of the simulated data.
school | student | type | value2017 | value2018 |
---|---|---|---|---|
1 | 1 | Score | 57 | 73 |
1 | 1 | Percentile | 74 | 32 |
1 | 2 | Score | 42 | 79 |
1 | 2 | Percentile | 19 | 84 |
1 | 3 | Score | 44 | |
1 | 3 | Percentile | 25 | |
1 | 5 | Score | 45 | 71 |
1 | 5 | Percentile | 29 | 17 |
1 | 6 | Score | 47 | 75 |
1 | 6 | Percentile | 36 | 51 |
The data is not recorded data in a format that is immediately usable. Using our understanding of the tidyr
package, we can easily convert this table into a form that is useful for data analysis.
The variable school
uniquely identifies the school, but the variable student
only uniquely identifies the student within the school. The problem is best illustrated by the filter()
function in dplyr
.
untidy.school %>% filter(student==1) %>% head(4)
## # A tibble: 4 × 5
## school student type value2017 value2018
## <int> <int> <chr> <dbl> <dbl>
## 1 1 1 Score 57 73
## 2 1 1 Percentile 74 32
## 3 2 1 Score 63 74
## 4 2 1 Percentile 90 41
The subsetted table contains scores and percentiles for two completely different children identified by student==1
. We need to create a unique identifier for each student in the Cleveland school district. The unite()
function can be utilized to create a new variable called CID
by concatenating the identifiers for school
and student
. We want CID
to follow the general form SCHOOL.STUDENT. Create a new tibble called untidy2.school
that fixes this problem without dropping the original variables school
or student
. Read the documentation for unite()
either by searching on google or using ?unite
to prevent the loss of original variables in the creation of a new variable.
untidy2.school = untidy.school %>%
unite(CID,school,student,sep='.',remove=F)
glimpse(untidy2.school)
## Rows: 796
## Columns: 6
## $ CID <chr> "1.1", "1.1", "1.2", "1.2", "1.3", "1.3", "1.5", "1.5", "1.6…
## $ school <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ student <int> 1, 1, 2, 2, 3, 3, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, …
## $ type <chr> "Score", "Percentile", "Score", "Percentile", "Score", "Perc…
## $ value2017 <dbl> 57, 74, 42, 19, 44, 25, 45, 29, 47, 36, 48, 40, 54, 64, 64, …
## $ value2018 <dbl> 73, 32, 79, 84, NA, NA, 71, 17, 75, 51, 79, 84, 72, 24, 75, …
The variables value2017
and value2018
contain the scores and percentiles for two different years. In a new tibble called untidy3.school
, based on untidy2.school
, we want to create a new variable called Year
and a new variable called Value
that display the year and the result from that year, respectively. The variable Year
should be a numeric vector containing either 2017 or 2018. The most efficient way to modify the data in this manner is to start by renaming value2017
and value2018
to nonsynctactic names 2017
and 2018
. Remember that you need to surround nonsyncactic names with backticks to achieve this result.
untidy3.school = untidy2.school %>%
rename(`2017`=value2017,`2018`=value2018) %>%
# Other option
# gather(`2017`:`2018`,key="Year",value="Value",convert=T)
pivot_longer(`2017`:`2018`,names_to='Year',values_to='Value') %>%
mutate(Year=as.integer(Year))
glimpse(untidy3.school)
## Rows: 1,592
## Columns: 6
## $ CID <chr> "1.1", "1.1", "1.1", "1.1", "1.2", "1.2", "1.2", "1.2", "1.3",…
## $ school <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ student <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 5, 5, 5, 5, 6, 6, 6, 6, 7,…
## $ type <chr> "Score", "Score", "Percentile", "Percentile", "Score", "Score"…
## $ Year <int> 2017, 2018, 2017, 2018, 2017, 2018, 2017, 2018, 2017, 2018, 20…
## $ Value <dbl> 57, 73, 74, 32, 42, 79, 19, 84, 44, NA, 25, NA, 45, 71, 29, 17…
head(untidy3.school,10)
## # A tibble: 10 × 6
## CID school student type Year Value
## <chr> <int> <int> <chr> <int> <dbl>
## 1 1.1 1 1 Score 2017 57
## 2 1.1 1 1 Score 2018 73
## 3 1.1 1 1 Percentile 2017 74
## 4 1.1 1 1 Percentile 2018 32
## 5 1.2 1 2 Score 2017 42
## 6 1.2 1 2 Score 2018 79
## 7 1.2 1 2 Percentile 2017 19
## 8 1.2 1 2 Percentile 2018 84
## 9 1.3 1 3 Score 2017 44
## 10 1.3 1 3 Score 2018 NA
The variable type
in untidy3.school
indicates that two completely different variables are contained in the recently created variable called Value
. Both the scores and percentiles of students are contained in Value
. Using the function spread()
we can create two new variables, Score
and Percentile
, that display the information contained in Value
in separate columns. Using untidy3.school
, create a new tibble called tidy.school
that accomplishes these tasks.
tidy.school = untidy3.school %>%
pivot_wider(names_from=type,values_from=Value)
# Other option
# spread(key=type,value=Value)
glimpse(tidy.school)
## Rows: 796
## Columns: 6
## $ CID <chr> "1.1", "1.1", "1.2", "1.2", "1.3", "1.3", "1.5", "1.5", "1.…
## $ school <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ student <int> 1, 1, 2, 2, 3, 3, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11,…
## $ Year <int> 2017, 2018, 2017, 2018, 2017, 2018, 2017, 2018, 2017, 2018,…
## $ Score <dbl> 57, 73, 42, 79, 44, NA, 45, 71, 47, 75, 48, 79, 54, 72, 64,…
## $ Percentile <dbl> 74, 32, 19, 84, 25, NA, 29, 17, 36, 51, 40, 84, 64, 24, 91,…
head(tidy.school,10)
## # A tibble: 10 × 6
## CID school student Year Score Percentile
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 1.1 1 1 2017 57 74
## 2 1.1 1 1 2018 73 32
## 3 1.2 1 2 2017 42 19
## 4 1.2 1 2 2018 79 84
## 5 1.3 1 3 2017 44 25
## 6 1.3 1 3 2018 NA NA
## 7 1.5 1 5 2017 45 29
## 8 1.5 1 5 2018 71 17
## 9 1.6 1 6 2017 47 36
## 10 1.6 1 6 2018 75 51
The original data contains explicitly missing and implicitly missing values. Instances of both can be visibly seen in the first ten observations. Below is a table showing the first 10 observations in the cleaned dataset we called tidy.school
. To appropriately, view this we have to sort our observations by school
and student
as seen in the original dataset untidy.school
.
CID | school | student | Year | Score | Percentile |
---|---|---|---|---|---|
1.1 | 1 | 1 | 2017 | 57 | 74 |
1.1 | 1 | 1 | 2018 | 73 | 32 |
1.2 | 1 | 2 | 2017 | 42 | 19 |
1.2 | 1 | 2 | 2018 | 79 | 84 |
1.3 | 1 | 3 | 2017 | 44 | 25 |
1.3 | 1 | 3 | 2018 | ||
1.5 | 1 | 5 | 2017 | 45 | 29 |
1.5 | 1 | 5 | 2018 | 71 | 17 |
1.6 | 1 | 6 | 2017 | 47 | 36 |
1.6 | 1 | 6 | 2018 | 75 | 51 |
Based on the table above, you can see that student 3 from school 1 has a missing score and percentile for the year 2018. This is an example of explicitly missing information.
Based on the table above, you can see that student 4 from school 1 is clearly missing scores and percentiles from both years 2017 and 2018. This is an example of implicitly missing information.
Use the complete()
function to convert all implicitly missing to explicitly missing. Create a new table called tidy2.school
that reports missing values as NA
for all combinations of school, student, and year.
tidy2.school=tidy.school %>%
complete(school,student,Year)
The first 10 rows of tidy2.school
are displayed below.
tab.tidy2.school = tidy2.school %>%
head(10) %>%
xtable(digits=0,align="ccccccc")
print(tab.tidy2.school,type="html",include.rownames=F,
html.table.attributes="align='center',
rules='rows',
width=50%,
frame='hsides',
border-spacing=5px"
)
school | student | Year | CID | Score | Percentile |
---|---|---|---|---|---|
1 | 1 | 2017 | 1.1 | 57 | 74 |
1 | 1 | 2018 | 1.1 | 73 | 32 |
1 | 2 | 2017 | 1.2 | 42 | 19 |
1 | 2 | 2018 | 1.2 | 79 | 84 |
1 | 3 | 2017 | 1.3 | 44 | 25 |
1 | 3 | 2018 | 1.3 | ||
1 | 4 | 2017 | |||
1 | 4 | 2018 | |||
1 | 5 | 2017 | 1.5 | 45 | 29 |
1 | 5 | 2018 | 1.5 | 71 | 17 |
If you inspect the first 10 rows of tidy2.school
, you should see that the variable CID
is missing for student 4 from school 1 even though we know that this students unique district ID should be “1.4”. Using the pipe %>%
, combine all previous statements in an order where this will not occur. Create a tibble named final.tidy.school
using a chain of commands that begins with calling the original tibble untidy.school
final.tidy.school = untidy.school %>%
rename('2017'=value2017,'2018'=value2018) %>%
pivot_longer('2017':'2018',names_to="Year",values_to="Value") %>%
mutate(Year=as.integer(Year)) %>%
pivot_wider(names_from=type,values_from=Value) %>%
complete(school,student,Year) %>%
unite(CID,school,student,sep=".",remove=F)
final.tidy.school
## # A tibble: 800 × 6
## CID school student Year Score Percentile
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 1.1 1 1 2017 57 74
## 2 1.1 1 1 2018 73 32
## 3 1.2 1 2 2017 42 19
## 4 1.2 1 2 2018 79 84
## 5 1.3 1 3 2017 44 25
## 6 1.3 1 3 2018 NA NA
## 7 1.4 1 4 2017 NA NA
## 8 1.4 1 4 2018 NA NA
## 9 1.5 1 5 2017 45 29
## 10 1.5 1 5 2018 71 17
## # … with 790 more rows
The figure below uses boxplots to show the distribution of scores in the 20 schools for the years 2017 and 2018. How would you interpret it?
ggplot(final.tidy.school) +
geom_boxplot(aes(x=as.factor(Year),y=Score,fill=as.factor(school))) +
guides(fill=F)+
theme_minimal()
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: Removed 7 rows containing non-finite values (stat_boxplot).
Using different colors for each student, the next two pictures show the change in test scores and percentiles for all students (without missing values) sampled from the district. Both of these pictures are necessary in understanding the improvement in mathematical knowledge on the student level. As you can see, they are very different from each other. Hypothesize a reason that would have caused this phenomenon to occur.
ggplot(final.tidy.school) +
geom_line(aes(x=Year,y=Score,color=as.factor(CID))) +
guides(color=F) +
scale_x_discrete(breaks=c(2017,2018),labels=c(2017,2018)) +
theme_minimal()
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: Removed 7 row(s) containing missing values (geom_path).
ggplot(final.tidy.school) +
geom_line(aes(x=Year,y=Percentile,color=as.factor(CID))) +
guides(color=F) +
scale_x_discrete(breaks=c(2017,2018),labels=c(2017,2018)) +
theme_minimal()
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: Removed 7 row(s) containing missing values (geom_path).