Introduction

The main purpose of this tutorial is to put together all 5 key functions from dplyr and use them to create summary tables and graphs in RMarkdown. The functions and their purposes are listed as follows:

filter() Selects Observations Based on Values
arrange() Sorts Observations Based on Criteria
select() or rename() Selects, Deselects, Renames, and Reorders Variables
mutate() or transmute() Creates New Variables Which Were Originally Nonexistant
summarize() Aggregates the Data Based on Helpful Summary Statistics

We will continue to practice our skills using the dataset flights by loading the R package nycflights13.

Part 1: Summarizing the Data

Using the pipe %>% to chain functions from dplyr, modify the code below to create a dataset named flight.summary that contains the following ordered modifications:

Starts with the raw dataset flights in the package nycflights13.
Transform delay variables so that they are measured in minutes since midnight, and then create a metric to measure accuracy based on those transformed delay variables.
Remove observations with missing accuracy.
Group data based on combinations of “origin”, “dest”, and “carrier”.
Summarize the data using the number of observations, the accuracy mean, the accuracy variance, and the mean distance for each combination of “origin”, “dest”, and “carrier”.
Filters summarized data to remove all combinations of “origin”, “dest”, and “carrier” with less than or equal to 10 flights throughout 2013.
Create a variable called “proportion”” that represents the proportion of flights in each combination out of the whole dataset.

flight.summary = 
  
  #1
  flights %>%
          
  #2
  mutate(
    dep_min       = (dep_time%/%100)*60+dep_time%%100,
    sched_dep_min = (sched_dep_time%/%100)*60+sched_dep_time%%100,
    arr_min       = (arr_time%/%100)*60+arr_time%%100,
    sched_arr_min = (sched_arr_time%/%100)*60+sched_arr_time%%100,
    dep_delay_min = dep_min-sched_dep_min,
    arr_delay_min = arr_min-sched_arr_min,
    accuracy      = abs(dep_delay_min) + abs(arr_delay_min)
  ) %>%
  
  #3
  filter(!is.na(accuracy)
  ) %>%
  
  #4
  group_by(origin, dest, carrier) %>%

  #5
  summarize(
    count=n(),
    mean.acc=mean(accuracy),
    var.acc=var(accuracy),
    mean.dist=mean(distance,na.rm=TRUE),
    .groups = 'drop'
  ) %>%
  
  #6
  filter(count>10) %>%
  
  #7
  mutate(proportion=count/sum(count))

head(flight.summary,10)

## # A tibble: 10 × 8
##    origin dest  carrier count mean.acc var.acc mean.dist proportion
##    <chr>  <chr> <chr>   <int>    <dbl>   <dbl>     <dbl>      <dbl>
##  1 EWR    ALB   EV        418     77.9  46873.       143   0.00127 
##  2 EWR    ATL   DL       3122     46.9  21254.       746   0.00952 
##  3 EWR    ATL   EV       1658     92.2  66850.       746   0.00506 
##  4 EWR    ATL   UA        102     46.5   8893.       746   0.000311
##  5 EWR    AUS   UA        664    163.  172017.      1504   0.00203 
##  6 EWR    AUS   WN        295     27.8    950.      1504   0.000900
##  7 EWR    AVL   EV        253     38.6   4009.       583   0.000772
##  8 EWR    BDL   EV        405     88.0  98628.       116   0.00124 
##  9 EWR    BNA   EV       1682     64.9  20108.       748   0.00513 
## 10 EWR    BNA   WN        564     29.8   4550.       748   0.00172

Part 2: Building Charts From Summary Data

The purpose of creating this plot was to investigate if flight accuracy gets worse or better for longer flights. The chart above displays the approximated linear relationship between flight accuracy and flight distance for each the 3 NYC airports. This first image was created using a combination of geom_point() and geom_smooth(). Recreate this image using ggplot() and remember to modify the defaults of geom_smooth() to get linear trend lines without standard error regions.

ggplot(data=flight.summary) +
  geom_point(aes(x=mean.dist, y=mean.acc, color=origin))+
  geom_smooth(aes(x=mean.dist, y=mean.acc, color=origin), method='lm', se=F)

## `geom_smooth()` using formula = 'y ~ x'

The first thing we notice is that not many flights are over 3000 miles in distance. Design a new plot similar to the one above that ignores the scenarios where the distance exceeds 3000. This will prevent rare occurrences from effecting our linear trend comparison and present these relationships in a zoomed-in window where the majority of data exists.

ggplot(data=filter(flight.summary,mean.dist<=3000)) +
  geom_point(aes(x=mean.dist, y=mean.acc, color=origin))+
  geom_smooth(aes(x=mean.dist, y=mean.acc, color=origin), method='lm', se=F)

## `geom_smooth()` using formula = 'y ~ x'

Based on your work, think of answers to the following 2 discussion questions:

Which trend line was most affected by the removal of few occasions where the average distance exceeded 3000 miles?
Each trend line is doing different things. What do you find from the three trend lines? (Hint: relationship between accuracy and distance for different airports)

Part 3: Build a Nice Table to Be Displayed in RMarkdown

The dataset we created, flight.summary, summarizes the raw data but still contains 370 observations. Below I have provided code, that produces a dataset called flight.summary2 that only contains the top 5 and bottom 5 combinations of “origin”, “dest”, and “carrier” based on mean accuracy. Closely examine this code and ensure that you understand what is happening here. Set eval=T to create flight.summary2.

flight.summary2 = 
  flight.summary %>%
  mutate(rank=min_rank(mean.acc)) %>%
  filter(min_rank(mean.acc)<=5 | min_rank(desc(mean.acc))<=5) %>%
  arrange(rank)

The kable() function in the knitr package allows for creating HTML tables. Furthermore, the kableExtra package allows you to add beauty to those internet-ready tables. Click here for a helpful introduction. Once you understand the function of kable() and the modifications you can apply, start an R code chunk to place the information required to produce an HTML table. Click here for additional guidance on the options required to turn a tibble into HTML code using kable() and then HTML code into an integrated webpage table.

flight.summary2 %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

origin	dest	carrier	count	mean.acc	var.acc	mean.dist	proportion	rank
LGA	ATL	WN	58	16.62069	460.06413	762	0.0001769	1
JFK	ATL	9E	52	18.98077	383.78394	760	0.0001586	2
LGA	BWI	WN	15	19.06667	53.20952	185	0.0000458	3
LGA	MKE	FL	55	20.43636	235.06532	738	0.0001678	4
JFK	PSP	VX	18	22.22222	146.41830	2378	0.0000549	5
LGA	MHT	EV	98	313.02041	493392.76247	195	0.0002989	366
JFK	AUS	B6	744	323.30108	342763.41663	1521	0.0022693	367
JFK	PSE	B6	361	565.92244	481630.38841	1617	0.0011011	368
JFK	DEN	B6	337	679.21958	553766.22545	1626	0.0010279	369
JFK	PDX	B6	324	707.92901	528144.31383	2454	0.0009882	370

Conclusion

In this class you will be forced to analyze data you probably care very little about. This is the part in the class where you begin to act like you care. An important rule in this class is to “love thy data as thyself.” Consider the flights data from both the view of the consumer and the airline executive. Think of the different questions that may arise on both sides. A consumer, such as myself, might view the data as an opportunity to explain to American Airlines why they are embarassing all Americans. The airline executive might want to know how their carrier compares to other carriers on departure and arrival delays. Once you have the question you want to answer, the next step is using the data to answer your question.

STOR 320 Tutorial on Data Transformation 2

September 16, 2024

Introduction

Part 1: Summarizing the Data

Part 2: Building Charts From Summary Data

Part 3: Build a Nice Table to Be Displayed in RMarkdown

Conclusion