To finish off Chapter 12, let’s pull together everything you’ve
learned to tackle a realistic data tidying problem. The
tidyr::who
dataset contains tuberculosis (TB) cases broken
down by year, country, age, gender, and diagnosis method. The data comes
from the 2014 World Health Organization Global Tuberculosis Report,
available at http://www.who.int/tb/country/data/download/en/.
There’s a wealth of epidemiological information in this dataset (7240 rows, 60 columns), but it’s challenging to work with the data in the form that it’s provided:
head(who,10)
## # A tibble: 10 × 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghani… AF AFG 1980 NA NA NA NA
## 2 Afghani… AF AFG 1981 NA NA NA NA
## 3 Afghani… AF AFG 1982 NA NA NA NA
## 4 Afghani… AF AFG 1983 NA NA NA NA
## 5 Afghani… AF AFG 1984 NA NA NA NA
## 6 Afghani… AF AFG 1985 NA NA NA NA
## 7 Afghani… AF AFG 1986 NA NA NA NA
## 8 Afghani… AF AFG 1987 NA NA NA NA
## 9 Afghani… AF AFG 1988 NA NA NA NA
## 10 Afghani… AF AFG 1989 NA NA NA NA
## # ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
## # new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## # new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
## # new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
## # new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
## # new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>,
## # new_sn_f014 <dbl>, new_sn_f1524 <dbl>, new_sn_f2534 <dbl>, …
This is a very typical real-life example dataset. It contains
redundant columns, odd variable codes, and many missing values. In
short, who
is messy, and we’ll need multiple steps to tidy
it. Like dplyr, tidyr is designed so that each function does one thing
well. That means in real-life situations you’ll usually need to string
together multiple verbs into a pipeline.
When you get the desired result for each step, change
Eval=F
to Eval=T
and knit the document to HTML
to make sure it works. After you complete the lab, you should submit
your HTML file of what you have completed on Canvas before the
deadline.
Some observations on the data:
It looks like country
, iso2
, and
iso3
are three variables that redundantly specify the
country.
year
is clearly also a varialbe.
We don’t know what all the other columns are yet, but given the
structure in the variable names (e.g. new_sp_m014
,
new_ep_m014
, new_ep_f014
) these are likely to
be values, not variables.
We don’t know what those values represent yet, so we’ll give them the
generic name “key”. We know the cells represent the count of cases, so
we’ll use the variable cases
. There are a lot of missing
values in the current representation, so for now we’ll use
values_drop_na
just so we can focus on the values that are
present.
who1 <- who %>%
pivot_longer(
cols = VAR_START:VAR_END,
names_to = FILL,
values_to = FILL,
values_drop_na = TRUE_OR_FALSE
)
head(who1,10)
For the key
column, the data dictionary tells us:
The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
The next two letters describe the type of TB:
rel
stands for cases of relapseep
stands for cases of extrapulmonary TBsn
stands for cases of pulmonary TB that could not be
diagnosed by a pulmonary smear (smear negative)sp
stands for cases of pulmonary TB that could be
diagnosed be a pulmonary smear (smear positive)The sixth letter gives the sex of TB patients. The dataset groups
cases by males (m
) and females (f
).
The remaining numbers gives the age group. The dataset groups cases into seven age groups:
014
= 0 – 14 years old1524
= 15 – 24 years old2534
= 25 – 34 years old3544
= 35 – 44 years old4554
= 45 – 54 years old5564
= 55 – 64 years old65
= 65 or olderThe names are slightly inconsistent for key
because
instead of new_rel
we have newrel
. Run the
following code to make it consistent:
who2 <- who1 %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
key
column into columns
new
, type
and sexage
. Then drop
the new
column because it’s constant in this dataset.
Please also drop iso2
, iso3
as they are also
redundant.who3 <- who2 %>%
separate(VAR, c(VAR1,VAR2,VAR3), sep = FILL) %>%
select(DROP_VARIABLES)
sexage
column into columns
sex
and age
. (Hint: if sep=Number, interpreted
as positions to split at)who4 <- who3 %>%
separate(VAR, c(VAR1, VAR2), sep = FILL)
#
First, let’s import a Comma Separated Values .csv
file
that exists on the internet. The .csv
file
dem_score.csv
contains ratings of the level of democracy in
different countries spanning 1952 to 1992 and is accessible at https://moderndive.com/data/dem_score.csv. Let’s use the
read_csv()
function from the readr
package to
read it off the web, import it into R, and save it in a data frame
called dem_score
. In the following part, we’re going to
focus on only data corresponding to Guatemala.
dem_score <- read_csv("https://moderndive.com/data/dem_score.csv")
## Rows: 96 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (9): 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
guat_dem <- dem_score %>%
filter(FILL)
guat_dem
year
and put values to a new variable
democracy_score
. Make sure the year
column is
of integer type.guat_dem_tidy <- guat_dem %>%
pivot_longer(cols = SELECT_COLUMNS,
names_to = FILL,
values_to = FILL) %>%
mutate(FILL)
guat_dem_tidy
guat_dem_tidy
data to
reflect the democracy trend in Guatemala.#