The main purpose of this tutorial is to put together all 5 key functions from dplyr and use them to create summary tables and graphs in RMarkdown. The functions and their purposes are listed as follows:
filter()
Selects Observations Based on
Values
arrange()
Sorts Observations Based on
Criteria
select()
or rename()
Selects,
Deselects, Renames, and Reorders Variables
mutate()
or transmute()
Creates New
Variables Which Were Originally Nonexistant
summarize()
Aggregates the Data Based on Helpful
Summary Statistics
We will continue to practice our skills using the dataset
flights
by loading the R package
nycflights13
.
Using the pipe %>%
to chain functions from
dplyr
, modify the code below to create a dataset named
flight.summary
that contains the following ordered
modifications:
Starts with the raw dataset flights
in the package
nycflights13
.
Transform delay variables so that they are measured in minutes since midnight, and then create a metric to measure accuracy based on those transformed delay variables.
Remove observations with missing accuracy
.
Group data based on combinations of “origin”, “dest”, and “carrier”.
Summarize the data using the number of observations, the accuracy mean, the accuracy variance, and the mean distance for each combination of “origin”, “dest”, and “carrier”.
Filters summarized data to remove all combinations of “origin”, “dest”, and “carrier” with less than or equal to 10 flights throughout 2013.
Create a variable called “proportion”” that represents the proportion of flights in each combination out of the whole dataset.
flight.summary =
#1
flights %>%
#2
mutate(
dep_min = (dep_time%/%100)*60+dep_time%%100,
sched_dep_min = (sched_dep_time%/%100)*60+sched_dep_time%%100,
arr_min = (arr_time%/%100)*60+arr_time%%100,
sched_arr_min = (sched_arr_time%/%100)*60+sched_arr_time%%100,
dep_delay_min = dep_min-sched_dep_min,
arr_delay_min = arr_min-sched_arr_min,
accuracy = abs(dep_delay_min) + abs(arr_delay_min)
) %>%
#3
filter(!is.na(accuracy)
) %>%
#4
group_by(origin, dest, carrier) %>%
#5
summarize(
count=n(),
mean.acc=mean(accuracy),
var.acc=var(accuracy),
mean.dist=mean(distance,na.rm=TRUE),
.groups = 'drop'
) %>%
#6
filter(count>10) %>%
#7
mutate(proportion=count/sum(count))
head(flight.summary,10)
## # A tibble: 10 × 8
## origin dest carrier count mean.acc var.acc mean.dist proportion
## <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 EWR ALB EV 418 77.9 46873. 143 0.00127
## 2 EWR ATL DL 3122 46.9 21254. 746 0.00952
## 3 EWR ATL EV 1658 92.2 66850. 746 0.00506
## 4 EWR ATL UA 102 46.5 8893. 746 0.000311
## 5 EWR AUS UA 664 163. 172017. 1504 0.00203
## 6 EWR AUS WN 295 27.8 950. 1504 0.000900
## 7 EWR AVL EV 253 38.6 4009. 583 0.000772
## 8 EWR BDL EV 405 88.0 98628. 116 0.00124
## 9 EWR BNA EV 1682 64.9 20108. 748 0.00513
## 10 EWR BNA WN 564 29.8 4550. 748 0.00172
The purpose of creating this plot was to investigate if flight
accuracy gets worse or better for longer flights. The chart above
displays the approximated linear relationship between flight accuracy
and flight distance for each the 3 NYC airports. This first image was
created using a combination of geom_point()
and
geom_smooth()
. Recreate this image using
ggplot()
and remember to modify the defaults of
geom_smooth()
to get linear trend lines without standard
error regions.
ggplot(data=flight.summary) +
geom_point(aes(x=mean.dist, y=mean.acc, color=origin))+
geom_smooth(aes(x=mean.dist, y=mean.acc, color=origin), method='lm', se=F)
## `geom_smooth()` using formula = 'y ~ x'
The first thing we notice is that not many flights are over 3000 miles in distance. Design a new plot similar to the one above that ignores the scenarios where the distance exceeds 3000. This will prevent rare occurrences from effecting our linear trend comparison and present these relationships in a zoomed-in window where the majority of data exists.
ggplot(data=filter(flight.summary,mean.dist<=3000)) +
geom_point(aes(x=mean.dist, y=mean.acc, color=origin))+
geom_smooth(aes(x=mean.dist, y=mean.acc, color=origin), method='lm', se=F)
## `geom_smooth()` using formula = 'y ~ x'
Based on your work, think of answers to the following 2 discussion questions:
Which trend line was most affected by the removal of few occasions where the average distance exceeded 3000 miles?
Each trend line is doing different things. What do you find from the three trend lines? (Hint: relationship between accuracy and distance for different airports)
The dataset we created, flight.summary
, summarizes the
raw data but still contains 370 observations. Below I have provided
code, that produces a dataset called flight.summary2
that
only contains the top 5 and bottom 5 combinations of “origin”, “dest”,
and “carrier” based on mean accuracy. Closely examine this code and
ensure that you understand what is happening here. Set
eval=T
to create flight.summary2
.
flight.summary2 =
flight.summary %>%
mutate(rank=min_rank(mean.acc)) %>%
filter(min_rank(mean.acc)<=5 | min_rank(desc(mean.acc))<=5) %>%
arrange(rank)
The kable()
function in the knitr
package
allows for creating HTML tables. Furthermore, the
kableExtra
package allows you to add beauty to those
internet-ready tables. Click here
for a helpful introduction. Once you understand the function of
kable()
and the modifications you can apply, start an R
code chunk to place the information required to produce an HTML table.
Click here for
additional guidance on the options required to turn a tibble into HTML
code using kable()
and then HTML code into an integrated
webpage table.
flight.summary2 %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
origin | dest | carrier | count | mean.acc | var.acc | mean.dist | proportion | rank |
---|---|---|---|---|---|---|---|---|
LGA | ATL | WN | 58 | 16.62069 | 460.06413 | 762 | 0.0001769 | 1 |
JFK | ATL | 9E | 52 | 18.98077 | 383.78394 | 760 | 0.0001586 | 2 |
LGA | BWI | WN | 15 | 19.06667 | 53.20952 | 185 | 0.0000458 | 3 |
LGA | MKE | FL | 55 | 20.43636 | 235.06532 | 738 | 0.0001678 | 4 |
JFK | PSP | VX | 18 | 22.22222 | 146.41830 | 2378 | 0.0000549 | 5 |
LGA | MHT | EV | 98 | 313.02041 | 493392.76247 | 195 | 0.0002989 | 366 |
JFK | AUS | B6 | 744 | 323.30108 | 342763.41663 | 1521 | 0.0022693 | 367 |
JFK | PSE | B6 | 361 | 565.92244 | 481630.38841 | 1617 | 0.0011011 | 368 |
JFK | DEN | B6 | 337 | 679.21958 | 553766.22545 | 1626 | 0.0010279 | 369 |
JFK | PDX | B6 | 324 | 707.92901 | 528144.31383 | 2454 | 0.0009882 | 370 |
In this class you will be forced to analyze data you probably care
very little about. This is the part in the class where you begin to
act like you care. An important rule in this class is to “love thy
data as thyself.” Consider the flights
data from both the
view of the consumer and the airline executive. Think of the different
questions that may arise on both sides. A consumer, such as myself,
might view the data as an opportunity to explain to American Airlines
why they are embarassing all Americans. The airline executive might want
to know how their carrier compares to other carriers on departure and
arrival delays. Once you have the question you want to answer, the next
step is using the data to answer your question.