Anscombe’s Quartet

Anscombe quartet emphasizes the need to move beyond basic numerical summaries of your data. The anscombe dataset has four sets of x and y variables with very similar summaries, but distinct visual patterns

Prep the data

anscombe

##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

First we’ll use tidyr to reshape the anscombe dataset to make it easier to work with. We want a column to identify each point, id, a column for the series (x1 is the x value in series 1), and columns for x and y. In the case of the anscombe dataset, rows group x and y vaules, but are not important across series.

library(tidyverse)
tidy_anscombe <- anscombe %>%
  mutate(id = row_number()) %>%
  gather(key = key, value = value, everything(), -id)
tidy_anscombe %>% as.tbl

## # A tibble: 88 x 3
##       id   key value
##    <int> <chr> <dbl>
##  1     1    x1    10
##  2     2    x1     8
##  3     3    x1    13
##  4     4    x1     9
##  5     5    x1    11
##  6     6    x1    14
##  7     7    x1     6
##  8     8    x1     4
##  9     9    x1    12
## 10    10    x1     7
## # ... with 78 more rows

Now we want can split the key column into an x_or_y column and a series column.

tidy_anscombe <- tidy_anscombe %>%
  separate(key, c("x_or_y", "series"), 1)
tidy_anscombe %>% as.tbl

## # A tibble: 88 x 4
##       id x_or_y series value
##  * <int>  <chr>  <chr> <dbl>
##  1     1      x      1    10
##  2     2      x      1     8
##  3     3      x      1    13
##  4     4      x      1     9
##  5     5      x      1    11
##  6     6      x      1    14
##  7     7      x      1     6
##  8     8      x      1     4
##  9     9      x      1    12
## 10    10      x      1     7
## # ... with 78 more rows

Now we can use spread() to create the final form of our table, regrouping the associated x and y values. We could have done something simpler since we knew there were only 4 series, but the code we used will work for an arbitrary number of series.

tidy_anscombe <- tidy_anscombe %>%
  spread(x_or_y, value)
tidy_anscombe %>% as.tbl

## # A tibble: 44 x 4
##       id series     x     y
##  * <int>  <chr> <dbl> <dbl>
##  1     1      1    10  8.04
##  2     1      2    10  9.14
##  3     1      3    10  7.46
##  4     1      4     8  6.58
##  5     2      1     8  6.95
##  6     2      2     8  8.14
##  7     2      3     8  6.77
##  8     2      4     8  5.76
##  9     3      1    13  7.58
## 10     3      2    13  8.74
## # ... with 34 more rows

Numeric summary

tidy_anscombe %>%
  group_by(series) %>%
  summarise(
    mean_x = mean(x),
    mean_y = mean(y),
    sd_x = sd(x),
    sd_y = sd(y),
    cor = cor(x, y)
  )

## # A tibble: 4 x 6
##   series mean_x   mean_y     sd_x     sd_y       cor
##    <chr>  <dbl>    <dbl>    <dbl>    <dbl>     <dbl>
## 1      1      9 7.500909 3.316625 2.031568 0.8164205
## 2      2      9 7.500909 3.316625 2.031657 0.8162365
## 3      3      9 7.500000 3.316625 2.030424 0.8162867
## 4      4      9 7.500909 3.316625 2.030579 0.8165214

Visual summary

While the numeric summaries suggest very similar datasets, the visual summaries help identify the differences:

library(ggplot2)
tidy_anscombe %>%
  ggplot(aes(x, y)) +
  geom_point() +
  facet_wrap(~ series) +
  coord_fixed()

The Datasaurus Dozen

The Datasaurus Dozen is a set of series, like Anscombe’s quartet, with similar numerical summaries and radically different visual summaries. See a great discussion of this dataset by the creators, Justin Matejka and George Fitzmaurice here

Download the data here and move the DatasaurusDozen.tsv file into your data folder.

datasaurus <- read_tsv("data/DatasaurusDozen.tsv")
datasaurus %>%
  group_by(dataset) %>%
  summarise(
    mean_x = mean(x),
    mean_y = mean(y),
    sd_x = sd(x),
    sd_y = sd(y),
    cor = cor(x, y)
  )

## # A tibble: 13 x 6
##       dataset   mean_x   mean_y     sd_x     sd_y         cor
##         <chr>    <dbl>    <dbl>    <dbl>    <dbl>       <dbl>
##  1       away 54.26610 47.83472 16.76982 26.93974 -0.06412835
##  2   bullseye 54.26873 47.83082 16.76924 26.93573 -0.06858639
##  3     circle 54.26732 47.83772 16.76001 26.93004 -0.06834336
##  4       dino 54.26327 47.83225 16.76514 26.93540 -0.06447185
##  5       dots 54.26030 47.83983 16.76774 26.93019 -0.06034144
##  6    h_lines 54.26144 47.83025 16.76590 26.93988 -0.06171484
##  7 high_lines 54.26881 47.83545 16.76670 26.94000 -0.06850422
##  8 slant_down 54.26785 47.83590 16.76676 26.93610 -0.06897974
##  9   slant_up 54.26588 47.83150 16.76885 26.93861 -0.06860921
## 10       star 54.26734 47.83955 16.76896 26.93027 -0.06296110
## 11    v_lines 54.26993 47.83699 16.76996 26.93768 -0.06944557
## 12 wide_lines 54.26692 47.83160 16.77000 26.93790 -0.06657523
## 13    x_shape 54.26015 47.83972 16.76996 26.93000 -0.06558334

Visual summaries

datasaurus %>%
  ggplot(aes(x, y)) +
  geom_point() +
  facet_wrap(~ dataset, ncol = 6) +
  coord_fixed()