Introduction

To finish off Chapter 12, let’s pull together everything you’ve learned to tackle a realistic data tidying problem. The tidyr::who dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report, available at http://www.who.int/tb/country/data/download/en/.

There’s a wealth of epidemiological information in this dataset (7240 rows, 60 columns), but it’s challenging to work with the data in the form that it’s provided:

head(who,10)
## # A tibble: 10 × 60
##    country  iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##    <chr>    <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>
##  1 Afghani… AF    AFG    1980          NA           NA           NA           NA
##  2 Afghani… AF    AFG    1981          NA           NA           NA           NA
##  3 Afghani… AF    AFG    1982          NA           NA           NA           NA
##  4 Afghani… AF    AFG    1983          NA           NA           NA           NA
##  5 Afghani… AF    AFG    1984          NA           NA           NA           NA
##  6 Afghani… AF    AFG    1985          NA           NA           NA           NA
##  7 Afghani… AF    AFG    1986          NA           NA           NA           NA
##  8 Afghani… AF    AFG    1987          NA           NA           NA           NA
##  9 Afghani… AF    AFG    1988          NA           NA           NA           NA
## 10 Afghani… AF    AFG    1989          NA           NA           NA           NA
## # ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
## #   new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## #   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
## #   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
## #   new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
## #   new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>,
## #   new_sn_f014 <dbl>, new_sn_f1524 <dbl>, new_sn_f2534 <dbl>, …

This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who is messy, and we’ll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you’ll usually need to string together multiple verbs into a pipeline.

When you get the desired result for each step, change Eval=F to Eval=T and knit the document to HTML to make sure it works. After you complete the lab, you should submit your HTML file of what you have completed on Canvas before the deadline.

Part 1: Gather Variables Together

Some observations on the data:

Q1: Gather together all the columns from new_sp_m014 to newrel_f65.

We don’t know what those values represent yet, so we’ll give them the generic name “key”. We know the cells represent the count of cases, so we’ll use the variable cases. There are a lot of missing values in the current representation, so for now we’ll use values_drop_na just so we can focus on the values that are present.

who1 <- who %>% 
  pivot_longer(
    cols = VAR_START:VAR_END, 
    names_to = FILL, 
    values_to = FILL, 
    values_drop_na = TRUE_OR_FALSE
  )
head(who1,10)

Q2: Separate key column

For the key column, the data dictionary tells us:

  • The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.

  • The next two letters describe the type of TB:

    • rel stands for cases of relapse
    • ep stands for cases of extrapulmonary TB
    • sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
    • sp stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)
  • The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).

  • The remaining numbers gives the age group. The dataset groups cases into seven age groups:

    • 014 = 0 – 14 years old
    • 1524 = 15 – 24 years old
    • 2534 = 25 – 34 years old
    • 3544 = 35 – 44 years old
    • 4554 = 45 – 54 years old
    • 5564 = 55 – 64 years old
    • 65 = 65 or older
  • The names are slightly inconsistent for key because instead of new_rel we have newrel. Run the following code to make it consistent:

who2 <- who1 %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2

Q2: Separate the key column into columns new, type and sexage. Then drop the new column because it’s constant in this dataset. Please also drop iso2, iso3 as they are also redundant.

who3 <- who2 %>% 
  separate(VAR, c(VAR1,VAR2,VAR3), sep = FILL) %>%
  select(DROP_VARIABLES)

Q3: Separate the sexage column into columns sex and age. (Hint: if sep=Number, interpreted as positions to split at)

who4 <- who3 %>% 
  separate(VAR, c(VAR1, VAR2), sep = FILL)

Q4: Put all steps in one code chunk with pipe operator

#

More Exercises

First, let’s import a Comma Separated Values .csv file that exists on the internet. The .csv file dem_score.csv contains ratings of the level of democracy in different countries spanning 1952 to 1992 and is accessible at https://moderndive.com/data/dem_score.csv. Let’s use the read_csv() function from the readr package to read it off the web, import it into R, and save it in a data frame called dem_score. In the following part, we’re going to focus on only data corresponding to Guatemala.

dem_score <- read_csv("https://moderndive.com/data/dem_score.csv")
## Rows: 96 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (9): 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q5: In the following part, we’re going to focuse on only data corresponding to Guatemala.

guat_dem <- dem_score %>% 
  filter(FILL)
guat_dem

Q6: Gather the columns and put column names to a new variable year and put values to a new variable democracy_score. Make sure the year column is of integer type.

guat_dem_tidy <- guat_dem %>% 
  pivot_longer(cols = SELECT_COLUMNS,
               names_to = FILL, 
               values_to = FILL) %>%
  mutate(FILL)
guat_dem_tidy

Q7: Generate a plot based on the guat_dem_tidy data to reflect the democracy trend in Guatemala.

#