Lecture 3 Facets, Bubbles, and Transparency

3.1 Data

For this session, we’ll explore the Hawaii Tourism Authority (HTA) Air Seat Projection. I’ll be working with the Air Seat Projection for 2017 (revised 06/17). Feel free to download the latest available.

3.1.1 Importing non-standard Excel files

The first steps in preparing a non-standard Excel file are (1) identify how many rows to skip and (2) provide column names if the column names are not neatly contained in a single row. You may also want to set the range if there is metadata at the end of the table you are importing. range overrides any skip setting, so we wont have to specify the number of rows to skip.

library(tidyverse)
library(readxl)

seats <- read_excel("data/2017 Air Seat Forecast rev 0617.xls", col_names = c(
  "dep_city", 
  "seats2017Q1", "seats2017Q2", "seats2017Q3", "seats2017Q4", "seats2017", 
  "seats2016Q1", "seats2016Q2", "seats2016Q3", "seats2016Q4", "seats2016",
  "seatschangeQ1", "seatschangeQ2", "seatschangeQ3", "seatschangeQ4", "seatschange"
), range = "A5:P78")
seats

## # A tibble: 74 x 16
##     dep_city seats2017Q1 seats2017Q2 seats2017Q3 seats2017Q4 seats2017
##        <chr>       <dbl>       <dbl>       <dbl>       <dbl>     <dbl>
##  1     TOTAL     2987920     3016376     3168233     3050112  12222641
##  2 SCHEDULED     2966915     2996155     3140998     3029794  12133862
##  3  CHARTERS       21005       20221       27235       20318     88779
##  4      <NA>          NA          NA          NA          NA        NA
##  5  US TOTAL     1996549     2108969     2215424     2071513   8392455
##  6 SCHEDULED     1978616     2091981     2200195     2055171   8325963
##  7  CHARTERS       17933       16988       15229       16342     66492
##  8      <NA>          NA          NA          NA          NA        NA
##  9   US WEST     1717254     1837080     1943653     1817441   7315428
## 10 Anchorage       25758       15105       13674       17013     71550
## # ... with 64 more rows, and 10 more variables: seats2016Q1 <dbl>,
## #   seats2016Q2 <dbl>, seats2016Q3 <dbl>, seats2016Q4 <dbl>,
## #   seats2016 <dbl>, seatschangeQ1 <dbl>, seatschangeQ2 <chr>,
## #   seatschangeQ3 <chr>, seatschangeQ4 <dbl>, seatschange <dbl>

Let’s add a region identifier

us_west_range <- 10:23
us_east_range <- 26:33
japan_range <- 40:45
canada_range <- 48:52
other_asia_range <-55:58
oceania_range <- 61:64
other_range <- 67:74

seats$region <- NA
seats[us_west_range,]$region <- "US West"
seats[us_east_range,]$region <- "US East"
seats[japan_range,]$region <- "Japan"
seats[canada_range,]$region <- "Canada"
seats[other_asia_range,]$region <- "Other Asia"
seats[oceania_range,]$region <- "Oceania"
seats[other_range,]$region <- "Other"

seats <- seats %>%
  filter(!is.na(region))
seats

## # A tibble: 49 x 17
##          dep_city seats2017Q1 seats2017Q2 seats2017Q3 seats2017Q4
##             <chr>       <dbl>       <dbl>       <dbl>       <dbl>
##  1      Anchorage       25758       15105       13674       17013
##  2     Bellingham       10198         318          NA        6519
##  3         Denver       55803       51654       52585       43290
##  4      Las Vegas       70514       74322       75839       75415
##  5    Los Angeles      548935      647498      715338      647703
##  6        Oakland       84571      104810      116015       90703
##  7        Phoenix      113046      115125      125348      108863
##  8       Portland       90207       71068       65997       81673
##  9     Sacramento       37620       38318       38456       38456
## 10 Salt Lake City       26370       23751       22968       28322
## # ... with 39 more rows, and 12 more variables: seats2017 <dbl>,
## #   seats2016Q1 <dbl>, seats2016Q2 <dbl>, seats2016Q3 <dbl>,
## #   seats2016Q4 <dbl>, seats2016 <dbl>, seatschangeQ1 <dbl>,
## #   seatschangeQ2 <chr>, seatschangeQ3 <chr>, seatschangeQ4 <dbl>,
## #   seatschange <dbl>, region <chr>

3.2 Facets

Let’s do a simple plot comparing 2017 seats outlook to the 2016 seats outlook.

seats %>%
  ggplot(aes(seats2016, seats2017)) +
  geom_point()

The distribution of this data looks like a good candidate for using the log scale (high concentration in lower values and lower concentration in higher values).

seats %>%
  ggplot(aes(seats2016, seats2017)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() + 
  geom_abline(lty = 2) # dashed line type (lty)

Since we have region identifiers it would be nice to divide our data and see charts of each region side-by-side. Facets allow us to make multiple charts based on a variable or set of variables.

seats %>%
  ggplot(aes(seats2016, seats2017)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() + 
  geom_abline(lty = 2) +
  facet_wrap(~ region) +
  coord_fixed()

An alternative representation is to present each region using color:

seats %>%
  ggplot(aes(seats2016, seats2017, color = region, label = dep_city)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() + 
  geom_abline(lty = 2) +
  geom_text(check_overlap = TRUE, nudge_y = 0.1)

3.3 Bubbles

Bubble charts are scatter plots (geom_point) with points that vary in size corresponding to the value of a given variable. Let’s create a measure of the size of a city’s seats relative to its regional total.

seats <- seats %>%
  group_by(region) %>%
  mutate(proportion_of_region = seats2017/sum(seats2017))
seats

## # A tibble: 49 x 18
## # Groups:   region [7]
##          dep_city seats2017Q1 seats2017Q2 seats2017Q3 seats2017Q4
##             <chr>       <dbl>       <dbl>       <dbl>       <dbl>
##  1      Anchorage       25758       15105       13674       17013
##  2     Bellingham       10198         318          NA        6519
##  3         Denver       55803       51654       52585       43290
##  4      Las Vegas       70514       74322       75839       75415
##  5    Los Angeles      548935      647498      715338      647703
##  6        Oakland       84571      104810      116015       90703
##  7        Phoenix      113046      115125      125348      108863
##  8       Portland       90207       71068       65997       81673
##  9     Sacramento       37620       38318       38456       38456
## 10 Salt Lake City       26370       23751       22968       28322
## # ... with 39 more rows, and 13 more variables: seats2017 <dbl>,
## #   seats2016Q1 <dbl>, seats2016Q2 <dbl>, seats2016Q3 <dbl>,
## #   seats2016Q4 <dbl>, seats2016 <dbl>, seatschangeQ1 <dbl>,
## #   seatschangeQ2 <chr>, seatschangeQ3 <chr>, seatschangeQ4 <dbl>,
## #   seatschange <dbl>, region <chr>, proportion_of_region <dbl>

Now we can modify the chart to show the importance of each city in the context of its region.

seats %>%
  filter(region %in% c("US West", "US East")) %>%
  ggplot(aes(seats2016, seats2017, color = region, label = dep_city)) +
  geom_abline(lty = 2) +
  geom_point(aes(size = proportion_of_region)) +
  scale_x_log10(labels = scales::comma) +
  scale_y_log10(labels = scales::comma) + 
  geom_text(check_overlap = TRUE, nudge_y = 0.1)

3.4 Transparency

We can also use transparency (or alpha) to make less important points less visible. We do this by setting the alpha aesthetic. Let’s try adding the alpha setting to the geom_point() call first.

seats %>%
  filter(region %in% c("US West", "US East")) %>%
  ggplot(aes(seats2016, seats2017, color = region, label = dep_city)) +
  geom_abline(lty = 2) +
  geom_point(aes(size = proportion_of_region, alpha = proportion_of_region)) +
  scale_x_log10(labels = scales::comma) +
  scale_y_log10(labels = scales::comma) + 
  geom_text(check_overlap = TRUE, nudge_y = 0.1)

Let’s add the alpha to the ggplot-level aesthetic instead, so that it also affects the text labels.

seats %>%
  filter(region %in% c("US West", "US East")) %>%
  ggplot(aes(seats2016, seats2017, color = region, label = dep_city, alpha = proportion_of_region)) +
  geom_abline(lty = 2) +
  geom_point(aes(size = proportion_of_region)) +
  scale_x_log10(labels = scales::comma) +
  scale_y_log10(labels = scales::comma) + 
  geom_text(nudge_y = 0.1)

We can combine all the regions now and use transparency to help us see how many cities are in the same area on the plot by how dark a region is.

seats %>%
  ggplot(aes(seats2016, seats2017, label = dep_city, alpha = proportion_of_region)) +
  geom_abline(lty = 2) +
  geom_point(aes(size = proportion_of_region)) +
  scale_x_log10(labels = scales::comma) +
  scale_y_log10(labels = scales::comma) + 
  geom_text(aes(color = region), hjust = "right", vjust = "center")

3.5 Facets

Let’s use facets so we can combine everything we’ve done so far.

seats %>%
  ggplot(aes(seats2016, seats2017, label = dep_city, alpha = proportion_of_region)) +
  geom_abline(lty = 2) +
  geom_point(aes(size = proportion_of_region), color = "darkblue") +
  scale_x_log10(labels = scales::comma) +
  scale_y_log10(labels = scales::comma) + 
  geom_text(hjust = "right", vjust = "center", nudge_x = -0.3) +
  facet_wrap(~ region)

3.6 Assignment

Create a bubble plot highlighting the change in year-on-year growth rates for different quarters. Plot seatschangeQ3 on the x axis and seatschangeQ4 on the y axis. Use seats2017 to determine the size of each bubble. Facet by region.