Lecture 8 geom_area and geom_ribbon

8.1 Data

The US Bureau of Labor Statistics (BLS) conducts the American Time Use Survey (ATUS). You can download the text form of the ATUS by going to the BLS data page, finding the section labelled Spending & Time Use, then clicking on the “Text Files” button on the row for the ATUS. Or by using the following link:

https://download.bls.gov/pub/time.series/tu/

8.1.1 Downloading a file from the internet

While you can manually download the files from the above URL, download.file() lets you download files from within R. The first argument is the URL of the resource you want to download. The second argument is the destination for the file. The following requests will require you to create the tu folder.

download.file("https://download.bls.gov/pub/time.series/tu/tu.txt", "data/tu/tu.txt")
download.file("https://download.bls.gov/pub/time.series/tu/tu.series", "data/tu/tu.series")
download.file("https://download.bls.gov/pub/time.series/tu/tu.data.0.Current", "data/tu/tu.data.0.Current")

The file tu.txt contains the documentation for the time use (tu) survey data. Section 2 of that file provides descriptions of each of the files in the pub/time.series/tu folder. From that list we can see that tu.series will give us a list of the available series.

library(readr)
series_defn <- read_tsv("data/tu/tu.series")
series_defn

## # A tibble: 85,277 x 43
##             series_id seasonal stattype_code datays_code sex_code
##                 <chr>    <chr>         <int>       <chr>    <int>
##  1 TUU10100AA01000007        U         10100          01        0
##  2 TUU10100AA01000013        U         10100          01        0
##  3 TUU10100AA01000014        U         10100          01        0
##  4 TUU10100AA01000015        U         10100          01        0
##  5 TUU10100AA01000018        U         10100          01        0
##  6 TUU10100AA01000019        U         10100          01        0
##  7 TUU10100AA01000025        U         10100          01        0
##  8 TUU10100AA01000035        U         10100          01        0
##  9 TUU10100AA01000036        U         10100          01        0
## 10 TUU10100AA01000037        U         10100          01        0
## # ... with 85,267 more rows, and 38 more variables: region_code <chr>,
## #   lfstat_code <chr>, educ_code <chr>, maritlstat_code <chr>,
## #   age_code <chr>, orig_code <chr>, race_code <chr>, mjcow_code <chr>,
## #   nmet_code <int>, where_code <chr>, sjmj_code <int>,
## #   timeday_code <chr>, actcode_code <chr>, industry_code <chr>,
## #   occ_code <chr>, prhhchild_code <chr>, earn_code <chr>,
## #   disability_code <chr>, who_code <chr>, hhnscc03_code <chr>,
## #   schenr_code <int>, prownhhchild_code <chr>, work_code <int>,
## #   elnum_code <chr>, ecage_code <chr>, elfreq_code <int>,
## #   eldur_code <chr>, elwho_code <chr>, ecytd_code <int>,
## #   elder_code <int>, lfstatw_code <chr>, pertype_code <chr>,
## #   series_title <chr>, footnote_codes <chr>, begin_year <int>,
## #   begin_period <chr>, end_year <int>, end_period <chr>

There is a lot here to process. The columns we care most about for now are series_id and series_title. Using select() from the dplyr library, we can show just the columns we care about.

library(dplyr)
series_defn %>%
  select(series_id, series_title)

## # A tibble: 85,277 x 2
##             series_id
##                 <chr>
##  1 TUU10100AA01000007
##  2 TUU10100AA01000013
##  3 TUU10100AA01000014
##  4 TUU10100AA01000015
##  5 TUU10100AA01000018
##  6 TUU10100AA01000019
##  7 TUU10100AA01000025
##  8 TUU10100AA01000035
##  9 TUU10100AA01000036
## 10 TUU10100AA01000037
## # ... with 85,267 more rows, and 1 more variables: series_title <chr>

8.1.2 Pairing down the list of variables

Let’s look for variables on sleep, work, and leisure:

series_defn %>%
  select(series_title) %>%
  filter(grepl("sleep", series_title, ignore.case = TRUE))

## # A tibble: 1,310 x 1
##                                                                   series_title
##                                                                          <chr>
##  1                                                  Avg hrs per day - Sleeping
##  2                       Avg hrs per day - Sleeping, Weekend days and holidays
##  3                             Avg hrs per day - Sleeping, Nonholiday weekdays
##  4                                        Avg hrs per day - Sleeping, Employed
##  5             Avg hrs per day - Sleeping, Weekend days and holidays, Employed
##  6                   Avg hrs per day - Sleeping, Nonholiday weekdays, Employed
##  7                        Avg hrs per day - Sleeping, Employed, on days worked
##  8 Avg hrs per day - Sleeping, Weekend days and holidays, Employed, on days wo
##  9   Avg hrs per day - Sleeping, Nonholiday weekdays, Employed, on days worked
## 10                              Avg hrs per day - Sleeping, Employed full time
## # ... with 1,300 more rows

Since this simple search returns a ton of results, let’s further filter by ‘employed’ and ‘per day’:

series_defn %>%
  select(series_title) %>%
  filter(grepl("per day.*sleep.*employed", series_title, ignore.case = TRUE))

## # A tibble: 154 x 1
##                                                                   series_title
##                                                                          <chr>
##  1                                        Avg hrs per day - Sleeping, Employed
##  2             Avg hrs per day - Sleeping, Weekend days and holidays, Employed
##  3                   Avg hrs per day - Sleeping, Nonholiday weekdays, Employed
##  4                        Avg hrs per day - Sleeping, Employed, on days worked
##  5 Avg hrs per day - Sleeping, Weekend days and holidays, Employed, on days wo
##  6   Avg hrs per day - Sleeping, Nonholiday weekdays, Employed, on days worked
##  7                              Avg hrs per day - Sleeping, Employed full time
##  8   Avg hrs per day - Sleeping, Weekend days and holidays, Employed full time
##  9         Avg hrs per day - Sleeping, Nonholiday weekdays, Employed full time
## 10              Avg hrs per day - Sleeping, Employed full time, on days worked
## # ... with 144 more rows

Now let’s filter further by ‘employed full time’, ‘nonholiday weekdays’, and ‘on days worked’:

series_defn %>%
  select(series_title) %>%
  filter(grepl("per day.*sleep.*nonholiday weekdays.*employed full time.*on days worked", series_title, ignore.case = TRUE))

## # A tibble: 6 x 1
##                                                                  series_title
##                                                                         <chr>
## 1 Avg hrs per day - Sleeping, Nonholiday weekdays, Employed full time, on day
## 2 Avg hrs per day - Sleeping, Nonholiday weekdays, Employed full time, on day
## 3 Avg hrs per day - Sleeping, Nonholiday weekdays, Employed full time, on day
## 4 Avg hrs per day for participants - Sleeping, Nonholiday weekdays, Employed 
## 5 Avg hrs per day for participants - Sleeping, Nonholiday weekdays, Employed 
## 6 Avg hrs per day for participants - Sleeping, Nonholiday weekdays, Employed

Finally, let’s filter that to exclude the ‘participants only’ group and only get the Men/Women values (not the combined totals):

series_defn %>%
  select(series_title) %>%
  filter(grepl("per day -.*sleep.*nonholiday weekdays.*employed full time.*on days worked,", series_title, ignore.case = TRUE))

## # A tibble: 2 x 1
##                                                                  series_title
##                                                                         <chr>
## 1 Avg hrs per day - Sleeping, Nonholiday weekdays, Employed full time, on day
## 2 Avg hrs per day - Sleeping, Nonholiday weekdays, Employed full time, on day

8.1.3 Adding more activity categories

Now let’s add ‘work’ and ‘leisure’ to our search:

activity <- series_defn %>%
  select(series_id, series_title) %>%
  filter(grepl("per day -.*(sleep|work|leisure).*nonholiday weekdays.*employed full time.*on days worked,", series_title, ignore.case = TRUE))
activity

## # A tibble: 26 x 2
##             series_id
##                 <chr>
##  1 TUU10101AA01000344
##  2 TUU10101AA01000423
##  3 TUU10101AA01000962
##  4 TUU10101AA01001041
##  5 TUU10101AA01003012
##  6 TUU10101AA01003097
##  7 TUU10101AA01003307
##  8 TUU10101AA01003378
##  9 TUU10101AA01003947
## 10 TUU10101AA01004011
## # ... with 16 more rows, and 1 more variables: series_title <chr>

Now we should create a variable that codes each of these as either work, sleep, or leisure:

activity <- activity %>%
  mutate(
    activity_type = case_when(
      grepl("leisure", activity$series_title, ignore.case = TRUE) ~ "Leisure",
      grepl("sleep", activity$series_title, ignore.case = TRUE) ~ "Sleep",
      TRUE ~ "Work"
    ), 
    sex = ifelse(grepl("Men", series_title), "Men", "Women")
  )
activity

## # A tibble: 26 x 4
##             series_id
##                 <chr>
##  1 TUU10101AA01000344
##  2 TUU10101AA01000423
##  3 TUU10101AA01000962
##  4 TUU10101AA01001041
##  5 TUU10101AA01003012
##  6 TUU10101AA01003097
##  7 TUU10101AA01003307
##  8 TUU10101AA01003378
##  9 TUU10101AA01003947
## 10 TUU10101AA01004011
## # ... with 16 more rows, and 3 more variables: series_title <chr>,
## #   activity_type <chr>, sex <chr>

Now we can join the activity data.frame with the current data and create time series of each activity type we created.

data <- read_tsv("data/tu/tu.data.0.Current")
data <- data %>%
  inner_join(activity) %>%
  group_by(year, sex, activity_type) %>%
  summarize(hours = sum(as.numeric(value), na.rm = TRUE))
data

## # A tibble: 84 x 4
## # Groups:   year, sex [?]
##     year   sex activity_type hours
##    <int> <chr>         <chr> <dbl>
##  1  2003   Men       Leisure  8.49
##  2  2003   Men         Sleep  7.46
##  3  2003   Men          Work 19.26
##  4  2003 Women       Leisure  7.02
##  5  2003 Women         Sleep  7.65
##  6  2003 Women          Work 17.87
##  7  2004   Men       Leisure  8.66
##  8  2004   Men         Sleep  7.49
##  9  2004   Men          Work 18.92
## 10  2004 Women       Leisure  7.34
## # ... with 74 more rows

8.2 geom_area

geom_area is useful when components that naturally add to each other:

library(ggplot2)
ggplot(data, aes(year, hours, fill= activity_type)) + geom_area() + facet_wrap(~ sex)

8.3 geom_ribbon

data %>%
  ggplot(aes(x = year, group = sex, fill = activity_type)) + 
  geom_ribbon(mapping = aes(ymin = -hours * (sex == "Women"), ymax = hours * (sex == "Men")), data = . %>% filter(activity_type == "Work"), alpha = 0.5) +
  geom_ribbon(mapping = aes(ymin = -hours * (sex == "Women"), ymax = hours * (sex == "Men")), data = . %>% filter(activity_type == "Leisure"), alpha = 0.5) +
  geom_ribbon(mapping = aes(ymin = -hours * (sex == "Women"), ymax = hours * (sex == "Men")), data = . %>% filter(activity_type == "Sleep"), alpha = 0.5) +
  scale_y_continuous(
    name = "Average hours per work day (Fully Employed)",
    breaks = c(-20, -10, 0, 10, 20),
    labels = c("Women 20 hrs", "10 hrs", "0 hrs", "10 hrs", "Men 20 hrs"),
    limits = c(-20, 20)
    )

8.4 Assignment

Plot leisure computer use over time using separate lines for men and women. The y axis should display the amount of use in minutes. The plot should look like the following image (the aspect ratio can be different).