The main purpose of this tutorial is to practice 4 key functions from dplyr. The functions and their purposes are listed as follows:
filter()
Selects Observations Based on
Values
arrange()
Sorts Observations Based on
Criteria
select()
or rename()
Selects,
Deselects, Renames, and Reorders Variables
mutate()
or transmute()
Creates New
Variables Which Were Originally Nonexistant
We will practice our skills using the dataset flights
by
loading the R package nycflights13
.
Using filter()
, start by creating a new tibble called
f1a
that only contains records with flight number 807.
f1a <- filter(flights, flight==807)
head(f1a,5)
## # A tibble: 5 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 2 744 730 14 1017 1007
## 2 2013 1 3 729 730 -1 1000 1004
## 3 2013 1 4 728 730 -2 946 1004
## 4 2013 1 5 727 730 -3 941 1004
## 5 2013 1 7 901 730 91 1118 1004
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Now, apply select()
to create a new dataset
f1b
based on f1a
only containing variables
“flight”, “carrier”, “origin”, and “dest”.
f1b <- select(f1a, flight, carrier, origin, dest)
head(f1b,5)
## # A tibble: 5 × 4
## flight carrier origin dest
## <int> <chr> <chr> <chr>
## 1 807 DL EWR ATL
## 2 807 DL EWR ATL
## 3 807 DL EWR ATL
## 4 807 DL EWR ATL
## 5 807 DL EWR ATL
Rename the variable “dest” to “destination” in f1c
.
f1c <- rename(f1b, destination=dest)
head(f1c,5)
## # A tibble: 5 × 4
## flight carrier origin destination
## <int> <chr> <chr> <chr>
## 1 807 DL EWR ATL
## 2 807 DL EWR ATL
## 3 807 DL EWR ATL
## 4 807 DL EWR ATL
## 5 807 DL EWR ATL
Because f1c
only contains records for flight number 807,
sorting by flight number is irrelevant. Create a new dataset
f1d
that is identical in content to f1b
but is
sorted by “carrier” first, “origin” second, and “destination”” last. Use
the function head()
to display the first 5 rows of the data
in f1d
.
f1d <- arrange(f1c, carrier, origin, destination)
head(f1d, 5)
## # A tibble: 5 × 4
## flight carrier origin destination
## <int> <chr> <chr> <chr>
## 1 807 DL EWR ATL
## 2 807 DL EWR ATL
## 3 807 DL EWR ATL
## 4 807 DL EWR ATL
## 5 807 DL EWR ATL
At first glance, it seems that 807 uniquely represents the flight
from _______ to _______ by carrier ________. To confirm this, create a
new dataset f1e
that is f1d
sorted by the
three variables previously mentioned, all in descending order. The
function desc()
is required here. Follow this by displaying
the first 18 rows of the data in f1d
.
f1e <- arrange(f1d,desc(carrier),desc(origin),desc(destination))
head(f1e, 18)
## # A tibble: 18 × 4
## flight carrier origin destination
## <int> <chr> <chr> <chr>
## 1 807 WN EWR MDW
## 2 807 WN EWR MDW
## 3 807 WN EWR MDW
## 4 807 WN EWR MDW
## 5 807 UA LGA IAH
## 6 807 UA LGA IAH
## 7 807 UA LGA IAH
## 8 807 UA LGA IAH
## 9 807 UA EWR MCO
## 10 807 UA EWR MCO
## 11 807 UA EWR LAS
## 12 807 UA EWR IAH
## 13 807 UA EWR IAH
## 14 807 UA EWR IAH
## 15 807 UA EWR BOS
## 16 807 DL JFK PIT
## 17 807 DL EWR ATL
## 18 807 DL EWR ATL
May we conclude that flight numbers can be mapped to a unique carrier? No__ (Yes/No)
Why or why not?
If we know the flight number, carrier, and origin, can we know for sure what the destination will be? ____No____ (Yes/No)
Why or why not?
If a counter example is not found from f1e
, you can
answer the question via the following data transformation steps.
Let’s first get all the unique combinations of flight, carrier,
origin and destination and save them to a data frame
f1
.
f1 <- flights %>%
select(flight, carrier, origin, dest) %>%
group_by(flight, carrier, origin,dest) %>%
summarize(
n=n(),
) %>%
arrange(flight, carrier, origin)
## `summarise()` has grouped output by 'flight', 'carrier', 'origin'. You can
## override using the `.groups` argument.
Then we group the observations in f1
by flight, carrier
and origin and count observations in each group. If a certain
combination of flight number, carrier, and origin can be uniquely mapped
to a destination, each group should have only one observation
(f2$n==1
).
f2 <- f1 %>%
select(flight, carrier, origin, dest) %>%
group_by(flight, carrier, origin) %>%
summarize(
n=n(),
)
## `summarise()` has grouped output by 'flight', 'carrier'. You can override using
## the `.groups` argument.
My_answer = (sum(f2$n!=1)==0)
# If f2$n is not 1 for any of the observation, the claim is False. Therefore, as long as the sum is not 0, the answer should be False.
print(paste("If we know the flight number, carrier, and origin, we can know for sure what the destination will be. True or False? My answer is",My_answer))
## [1] "If we know the flight number, carrier, and origin, we can know for sure what the destination will be. True or False? My answer is FALSE"
The pipe %>%
is used in a similar fashion to
+
in the ggplot2()
package.
Pipe operator (%>%
) can be read as ‘then’.
One simple example: Compute the logarithm of \(x\), return suitably lagged and iterated differences, compute the exponential function and round the result to 1 decimal place.
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
Method 1:
round(exp(diff(log(x))), 1)
## [1] 3.3 1.8 1.6 0.5 0.3 0.1 48.8 1.1
Method 2:
x %>% log() %>%
diff() %>%
exp() %>%
round(1)
## [1] 3.3 1.8 1.6 0.5 0.3 0.1 48.8 1.1
It’s time to take a magical journey through the tidyverse on the
pipe. Start by using transmute()
to create a new variable
“dep_hr” based on “dep_time” which represents departure time in hours
since midnight. Recall the code for this is
dep_hr=dep_time%/%100+(dep_time%%100)/60
. In the same step,
do this also for “sched_dep_time”,“arr_time”, and “sched_arr_time”
naming the new variables “sched_dep_hr”, “arr_hr”, and “sched_arr_hr”,
respectively. Save all these new variables to a new tibble called
f2a
. Use the function names()
to ensure that
f2a
only contains the new variables and the function
head()
to view the top 5 rows.
f2a <- transmute(flights,
dep_hr=dep_time%/%100+(dep_time%%100)/60,
sched_dep_hr=sched_dep_time%/%100+(sched_dep_time%%100)/60,
arr_hr=arr_time%/%100+(arr_time%%100)/60,
sched_arr_hr=sched_arr_time%/%100+(sched_arr_time%%100)/60)
names(f2a)
## [1] "dep_hr" "sched_dep_hr" "arr_hr" "sched_arr_hr"
head(f2a,5)
## # A tibble: 5 × 4
## dep_hr sched_dep_hr arr_hr sched_arr_hr
## <dbl> <dbl> <dbl> <dbl>
## 1 5.28 5.25 8.5 8.32
## 2 5.55 5.48 8.83 8.5
## 3 5.7 5.67 9.38 8.83
## 4 5.73 5.75 10.1 10.4
## 5 5.9 6 8.2 8.62
Now we can create true delay variables, measured in hours, for both
departure and arrival. Using mutate()
, create a new
variable “dep_delay_hr” which equals the difference between “dep_hr” and
“sched_dep_hr”. Analogously, perform the same operation for arrival.
Call the new dataset f2b
.
f2b <- mutate(f2a,
dep_delay_hr=dep_hr-sched_dep_hr,
arr_delay_hr=arr_hr-sched_arr_hr)
head(f2b,5)
## # A tibble: 5 × 6
## dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_delay_hr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.28 5.25 8.5 8.32 0.0333 0.183
## 2 5.55 5.48 8.83 8.5 0.0667 0.333
## 3 5.7 5.67 9.38 8.83 0.0333 0.550
## 4 5.73 5.75 10.1 10.4 -0.0167 -0.300
## 5 5.9 6 8.2 8.62 -0.100 -0.417
Use the function percent_rank()
to create a new variable
“percent_dep_delay_hr” which represents the percentage of the variable
you previously created. Notice that you can develop variables based on
recently transformed variables in the same iteration of
mutate()
, but be careful little buddy because order
matters. Name the new dataset f2c
.
f2c <- mutate(f2b,
percent_dep_delay_hr=percent_rank(dep_delay_hr))
Use filter()
to select the observations where
percent_dep_delay_hr<0.1
or
percent_dep_delay_hr>0.9
in a new dataset
f2d
. The tibble f2d
will contain the bottom
10% and top 10% of flights based off “dep_delay_hr”.
f2d <- filter(f2c,percent_dep_delay_hr<0.1|percent_dep_delay_hr>0.9)
head(f2d,5)
## # A tibble: 5 × 7
## dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_delay_hr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6.03 6.17 8.2 8.33 -0.133 -0.133
## 2 6.37 6.5 10.3 10.2 -0.133 0.0500
## 3 6.62 6.75 9.5 9.58 -0.133 -0.0833
## 4 6.93 7.08 10.1 9.67 -0.150 0.450
## 5 8 8.17 9.82 9.92 -0.167 -0.100
## # ℹ 1 more variable: percent_dep_delay_hr <dbl>
Finally, sort the data using arrange()
from largest to
smallest based on the variable “percent_dep_delay_hr”. Name the sorted
tibble f2e. Use head()
on f2e
to show the top
5 flights based on the constructed variable “dep_delay_hr”.
f2e <- arrange(f2d,desc(percent_dep_delay_hr))
head(f2e,5)
## # A tibble: 5 × 7
## dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_delay_hr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 23.4 8.17 1.58 10.3 15.2 -8.75
## 2 23.0 7.98 1.35 10.4 15.0 -9.08
## 3 22.7 8.5 1 11.1 14.2 -10.1
## 4 23.4 10.3 1.23 12.4 13.1 -11.2
## 5 19.4 6.25 21.6 8.7 13.1 12.9
## # ℹ 1 more variable: percent_dep_delay_hr <dbl>
Getting the original data from flights
to
f2e
required multiple steps. If we know what we want to do
with the raw data from flights
, we can use the pipe
%>%
to obtain the same result without intermittently
introducing new tibbles into our global environment. The R code below
turns your pipe dreams into reality.
f2e.pipedream = flights %>%
transmute(dep_hr=dep_time%/%100+(dep_time%%100)/60,
sched_dep_hr=sched_dep_time%/%100+(sched_dep_time%%100)/60,
arr_hr=arr_time%/%100+(arr_time%%100)/60,
sched_arr_hr=sched_arr_time%/%100+(sched_arr_time%%100)/60) %>%
mutate(dep_delay_hr=dep_hr-sched_dep_hr,
arr_delay_hr=arr_hr-sched_arr_hr) %>%
mutate(percent_dep_delay_hr=percent_rank(dep_delay_hr)) %>%
filter(percent_dep_delay_hr<0.1|percent_dep_delay_hr>0.9) %>%
arrange(desc(percent_dep_delay_hr))
head(f2e.pipedream,5)
## # A tibble: 5 × 7
## dep_hr sched_dep_hr arr_hr sched_arr_hr dep_delay_hr arr_delay_hr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 23.4 8.17 1.58 10.3 15.2 -8.75
## 2 23.0 7.98 1.35 10.4 15.0 -9.08
## 3 22.7 8.5 1 11.1 14.2 -10.1
## 4 23.4 10.3 1.23 12.4 13.1 -11.2
## 5 19.4 6.25 21.6 8.7 13.1 12.9
## # ℹ 1 more variable: percent_dep_delay_hr <dbl>
The tibble f2e.pipedream
is identical to
f2e
in the number of observations (65,737), the number of
variables (7), and the order of observations. We can check to see if the
tibble f2e
is identical to f2e.pipedream
using
identical()
. Can you feel the sensation? It’s piping hot up
in here.
identical(f2e,f2e.pipedream)
## [1] TRUE
How would you measure the accuracy of individual flights? We can say
that a flight is accurate if it leaves on time and arrives on time.
Suppose we want to create an accuracy measure that captures this
information where larger values indicate more inaccurate flights. Try to
think of a creative way to measure accuracy by using
mutate()
to construct a new variable named “accuracy”. Call
the new tibble f.accuracy
. As a data scientist, the metric
you want is not always in the raw data. This is an example of problem
that requires a level of innovation for which a job will grant you some
cash money.
f.accuracy<-mutate(flights,
dep_hr=dep_time%/%100+(dep_time%%100)/60,
sched_dep_hr=sched_dep_time%/%100+(sched_dep_time%%100)/60,
arr_hr=arr_time%/%100+(arr_time%%100)/60,
sched_arr_hr=sched_arr_time%/%100+(sched_arr_time%%100)/60,
dep_delay_hr=dep_hr-sched_dep_hr,
arr_delay_hr=arr_hr-sched_arr_hr,
accuracy=abs(dep_delay_hr)+abs(arr_delay_hr))
head(f.accuracy,5)
## # A tibble: 5 × 26
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## # ℹ 18 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, dep_hr <dbl>,
## # sched_dep_hr <dbl>, arr_hr <dbl>, sched_arr_hr <dbl>, dep_delay_hr <dbl>,
## # arr_delay_hr <dbl>, accuracy <dbl>
Preparing for comparisons of airline carriers on accuracy, we use the
select()
function to create a new tibble named
f.accuracy2
which only contains the variables “carrier” and
“accuracy”.
f.accuracy2=select(f.accuracy,carrier,accuracy)
Next, we can evaluate carriers based on their average accuracy across
all flights based on our new metric. Furthermore, the standard deviation
of this accuracy metric can help us measure the consistency of these
airline carriers in performance. The summarize()
function
combined with group_by()
allows for quick aggregation on
the carrier level.
carrier.summary<- f.accuracy2 %>%
group_by(carrier) %>%
summarize(
mean.accuracy=mean(accuracy,na.rm=T),
sd.accuracy=sd(accuracy,na.rm=T)
) %>%
arrange(mean.accuracy)
carrier.summary
## # A tibble: 16 × 3
## carrier mean.accuracy sd.accuracy
## <chr> <dbl> <dbl>
## 1 HA 0.556 0.669
## 2 US 0.572 1.52
## 3 AS 0.837 2.11
## 4 AA 0.862 2.52
## 5 OO 0.888 1.28
## 6 MQ 0.921 2.69
## 7 WN 0.969 2.95
## 8 DL 0.979 3.25
## 9 F9 1.03 2.43
## 10 YV 1.04 1.98
## 11 9E 1.17 3.15
## 12 UA 1.17 3.70
## 13 FL 1.24 3.97
## 14 VX 1.34 4.28
## 15 EV 1.48 4.21
## 16 B6 2.15 6.05
Following this tutorial, we will explore the depth of
summarize()
, the most important function in
dplyr. For right now, just know that not all pipes are
created free and equal; however, this pipe %>%
is
straight-up magical. In the early stages, use the aforementioned pipe
with caution, but once you develop a tolerance, you are prescribed to
use it liberally.