Lecture 6 Boxplots and Violin Plots
Boxplots and violin plots are two important tools for visualizing the distribution of data within a dataset. The boxplot highlights the median, key percentiles, and outliers within a dataset. The violin plot takes a kernel density plot, rotates it 90 degrees, then mirrors it about the axis to create a shape that sometimes resembles a violin.
6.1 Data
The Social Security Administration releases data on earnings and employment each year. We’ll take a look at the data for 2014:
https://www.ssa.gov/policy/docs/statcomps/eedata_sc/2014/index.html
We’re going to download Table 1: “Number of persons with Social Security (OASDI) taxable earnings, amount taxable, and contributions, by state or other area, sex, and type of earnings, 2014”
Save that file as ‘ssa_earnings.xlsx’ in the data
folder
library(tidyverse)
library(readxl)
ssa <- read_xlsx("data/ssa_earnings.xlsx", range = "A7:J159",
col_names = c("state", "gender", "other", "other2", "number.total", "number.wage", "number.self",
"earnings.total", "earnings.wage", "earnings.self"))
ssa
## # A tibble: 153 x 10
## state gender other other2 number.total number.wage number.self
## <chr> <chr> <lgl> <lgl> <dbl> <dbl> <dbl>
## 1 Alabama <NA> NA NA 2355477 2215535 255253
## 2 <NA> Men NA NA 1200468 1116458 138895
## 3 <NA> Women NA NA 1155009 1099077 116357
## 4 Alaska <NA> NA NA 400007 375833 47696
## 5 <NA> Men NA NA 223464 209694 27884
## 6 <NA> Women NA NA 176543 166140 19812
## 7 Arizona <NA> NA NA 3189785 2997567 334292
## 8 <NA> Men NA NA 1660088 1551488 185753
## 9 <NA> Women NA NA 1529697 1446079 148539
## 10 Arkansas <NA> NA NA 1468898 1376249 163320
## # ... with 143 more rows, and 3 more variables: earnings.total <dbl>,
## # earnings.wage <dbl>, earnings.self <dbl>
The starting format is far from ideal. Each row should represent one group, so we don’t need any of the rows with totals.
It’s important to always read any footnotes and documentation that comes with the data you plan to use. Footnote c
for this table indicates that individuals with both wage and salary employment will be counted in both groups, but only once in the total. It is important to be aware of this double counting.
ssa_long <- ssa %>%
fill(state) %>%
filter(!is.na(gender)) %>%
reshape(varying = 5:10, direction = "long", timevar = "earnings_type") %>%
select(state, gender, earnings_type, number, earnings) %>%
mutate(per_capita = earnings / number)
6.2 Boxplots
ssa_long %>%
filter(earnings_type != "total") %>%
ggplot(aes(gender, per_capita)) +
geom_boxplot()
ssa_long %>%
ggplot(aes(gender, per_capita, fill = gender)) +
geom_boxplot() +
facet_grid(~ earnings_type)
6.3 Violin Plots
Let’s repeat the above plots using the violin plot type.
ssa_long %>%
filter(earnings_type != "total") %>%
ggplot(aes(gender, per_capita)) +
geom_violin()
ssa_long %>%
ggplot(aes(gender, per_capita, color = gender, fill = gender)) +
geom_violin() +
facet_grid(~ earnings_type)
6.4 Dot Plots
Dot plots appear similar to violin plots, but dot plots may be easier to interpret:
ssa_long %>%
ggplot(aes(gender, per_capita, color = gender, fill = gender)) +
geom_dotplot(binaxis = "y", stackdir = "center", position = "dodge") +
facet_grid(~ earnings_type)
6.5 Assignment
Create your own visualizations of the distribution of the earnings
and number
variables.