Lecture 9 Jitter, Rug, and Aesthetics
9.1 Data
The Panel Study of Income Dynamics (PSID) is the longest running longitudinal household survey in the world.
From the Data page, you can use the Data Center
to create customized datasets. We’ll use the Packaged Data option. Click the Main and Supplemental Studies link. Under the Supplemental Studies > Transition into Adulthood Supplement
section, select the download for 2015
.
To download the supplement you will need to sign in or register for a new account (by clicking the “New User?” link). Once you have logged in you should be able to download the zip file:
- ta2015.zip
9.1.1 Codebook
The TA2015_codebook.pdf
is the perfect place for us to identify some key variables of interest. The following is an excerpt listing the variables we will use:
TA150003 "2015 PSID FAMILY IW (ID) NUMBER"
2015 PSID Family Interview Number
TA150004 "2015 INDIVIDUAL SEQUENCE NUMBER"
2015 PSID Sequence Number
This variable provides a means of identifying an individual's status with regard to the
PSID family unit at the time of the 2015 PSID interview.
Value/Range Code Value/Range Text
1 - 20 Individuals in the family at the time of the 2015 PSID
interview
51 - 59 Individuals in institutions at the time of the 2015 PSID
interview
TA150005 "CURRENT STATE"
Current State (FIPS state codes)
TA150015 "A1_1 HOW SATISFIED W/ LIFE AS A WHOLE"
A1_1. We'd like to start by asking you about life in general. Please think about your
life-as-a-whole. How satisfied are you with it? Are you completely satisfied, very
satisfied, somewhat satisfied, not very satisfied, or not at all satisfied?
Value/Range Code Value/Range Text
1 Completely satisfied
2 Very satisfied
3 Somewhat satisfied
4 Not very satisfied
5 Not at all satisfied
8 DK
TA150092 "D28A NUMBER OF CHILDREN"
D28a. How many (biological,) adopted, or step-children do you have?
TA150128 "E1 EMPLOYMENT STATUS 1ST MENTION"
E1. Now we have some questions about employment. We would like to know about what you do -
- are you working now, looking for work, keeping house, a student, or what?--1ST MENTION
Value/Range Code Value/Range Text
1 Working now, including military
2 Only temporarily laid off; sick or maternity leave
3 Looking for work, unemployed
5 Disabled, permanently or temporarily
6 Keeping house
7 Student
TA150512 "F1 HOW MUCH EARN LAST YEAR"
F1. We try to understand how people all over the country are getting along financially, so
now I have some questions about earnings and income. How much did you earn altogether
from work in 2014, that is, before anything was deducted for taxes or other things,
including any income from bonuses, overtime, tips, commissions, military pay or any other
source?
Value/Range Code Value/Range Text
0 - 5,000,000 Actual amount
9,999,998 DK
9,999,999 NA; refused
9.1.1.1 FIPS
In preparation for working with these variables, we can setup arrays to take the place of the codebook. The tigris
package will give us the FIPS codes for each state:
install.packages(tigris)
library(tidyverse)
state_fips <- tigris::fips_codes %>%
group_by(state) %>%
summarize(fips = as.numeric(first(state_code)))
fips2state <- array()
fips2state[state_fips$fips] <- state_fips$state
fips2state
## [1] "AL" "AK" NA "AZ" "AR" "CA" NA "CO" "CT" "DE" "DC" "FL" "GA" NA
## [15] "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS"
## [29] "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA"
## [43] NA "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" NA "WA" "WV" "WI" "WY"
## [57] NA NA NA "AS" NA NA NA NA NA "GU" NA NA "MP" NA
## [71] NA "PR" NA "UM" NA NA NA "VI"
9.1.1.2 Satisfaction
satisfaction <- array()
satisfaction_levels <- c("Completely satisfied", "Very satisfied", "Somewhat satisfied", "Not very satisfied", "Not at all satisfied", "DK")
satisfaction[c(1, 2, 3, 4, 5, 8)] <- satisfaction_levels
satisfaction
## [1] "Completely satisfied" "Very satisfied" "Somewhat satisfied"
## [4] "Not very satisfied" "Not at all satisfied" NA
## [7] NA "DK"
9.1.1.3 Employment Status
We can also specify the array elements one by one:
employment <- array()
employment[1] <- "Working now, including military"
employment[2] <- "Only temporarily laid off; sick or maternity leave"
employment[3] <- "Looking for work, unemployed"
employment[5] <- "Disabled, permanently or temporarily"
employment[6] <- "Keeping house"
employment[7] <- "Student"
employment
## [1] "Working now, including military"
## [2] "Only temporarily laid off; sick or maternity leave"
## [3] "Looking for work, unemployed"
## [4] NA
## [5] "Disabled, permanently or temporarily"
## [6] "Keeping house"
## [7] "Student"
9.1.2 Preprocessing the SPSS File
If you don’t have them already, open the zip file and move the TA2015.txt
and TA2015.sps
files into the data
folder. For our import to work, We need to find a line that needs to be removed from the top of the sps file. The line we want to remove should look like the following line:
FILE HANDLE PSID / NAME = '[PATH]\TA2015.TXT' LRECL = 2173 .
To find this line we can output the first 20 lines of the TA2015.sps
file:
readLines("data/TA2015.sps", n = 10)
## [1] ""
## [2] "**************************************************************************"
## [3] " Label : Transition to Adulthood Study 2015"
## [4] " Rows : 1641"
## [5] " Columns : 1304"
## [6] " ASCII File Date : July 5, 2017"
## [7] "*************************************************************************."
## [8] ""
## [9] "FILE HANDLE PSID / NAME = '[PATH]\\TA2015.TXT' LRECL = 2173 ."
## [10] "DATA LIST FILE = PSID FIXED /"
Now we know the line to remove is line number 9, we can write a new file to be used in the processing step below.
input <- file("data/TA2015.sps")
output <- file("data/TA2015_clean.sps")
open(input, type = "r")
open(output, open = "w")
writeLines(readLines(input, n = 8), output)
invisible(readLines(input, n = 1))
writeLines(readLines(input), output)
close(input)
close(output)
readLines("data/TA2015_clean.sps", n = 10)
## [1] ""
## [2] "**************************************************************************"
## [3] " Label : Transition to Adulthood Study 2015"
## [4] " Rows : 1641"
## [5] " Columns : 1304"
## [6] " ASCII File Date : July 5, 2017"
## [7] "*************************************************************************."
## [8] ""
## [9] "DATA LIST FILE = PSID FIXED /"
## [10] " TA150001 1 - 1 TA150002 2 - 6 TA150003 7 - 11 "
9.1.3 Importing with the SPSS file using memisc
The memisc
package has useful tools for importing SPSS and Stata files that augment what already exists in base
. Unfortunately, one of its dependencies, MASS
, will mask the select method from dplyr
. To avoid this, instead of loading memisc
with library(memisc)
, we can prefix all memisc
functions with memisc::
.
install.packages("memisc")
ta_importer <- memisc::spss.fixed.file("data/TA2015.txt", columns.file = "data/TA2015_clean.sps", varlab.file = "data/TA2015_clean.sps", to.lower = FALSE)
ta_full <- memisc::as.data.set(ta_importer)
ta_full
##
## Data set with 1641 observations and 1304 variables
##
## TA150001 TA150002 TA150003 TA150004 TA150005 TA150006 TA150007 ...
## 1 1 1 4893 1 37 55 9 ...
## 2 1 2 2967 1 48 57 9 ...
## 3 1 3 6095 5 37 96 9 ...
## 4 1 4 3738 3 8 98 9 ...
## 5 1 5 6741 3 16 104 9 ...
## 6 1 6 4839 1 28 68 9 ...
## 7 1 7 3828 1 48 67 9 ...
## 8 1 8 5640 51 21 88 9 ...
## 9 1 9 5210 52 6 80 9 ...
## 10 1 10 339 3 18 93 9 ...
## 11 1 11 3192 2 26 66 9 ...
## 12 1 12 561 2 26 91 9 ...
## 13 1 13 2500 51 13 100 9 ...
## 14 1 14 5283 2 39 68 9 ...
## 15 1 15 6679 1 24 62 9 ...
## 16 1 16 3286 2 8 67 9 ...
## 17 1 17 3266 2 6 155 9 ...
## 18 1 18 3720 2 48 61 9 ...
## 19 1 19 3714 2 48 82 9 ...
## 20 1 20 6244 1 48 89 9 ...
## 21 1 21 4199 1 13 80 9 ...
## 22 1 22 3962 2 51 86 9 ...
## 23 1 23 2878 1 41 85 9 ...
## 24 1 24 3835 1 37 85 9 ...
## 25 1 25 487 51 13 91 9 ...
## .. ........ ........ ........ ........ ........ ........ ........ ...
## (25 of 1641 observations shown)
9.1.4 Transform Data
Take the arrays we created above to process the file down to the variables we selected.
ta <- ta_full %>%
as.data.frame() %>%
as.tbl() %>%
filter(TA150005 > 0) %>% # get rid of the 1 non-US response
transmute(
family_id = TA150003,
in_institution = TA150004 > 50,
state = fips2state[TA150005],
life_satisfaction = factor(satisfaction[TA150015], levels = satisfaction_levels, ordered = TRUE),
children = TA150092,
employment_status = employment[TA150128],
annual_earnings = TA150512 %>% na_if(9999999) %>% na_if(9999999)
)
ta
## # A tibble: 1,640 x 7
## family_id in_institution state life_satisfaction children
## <dbl> <lgl> <chr> <ord> <dbl>
## 1 4893 FALSE NC Somewhat satisfied 0
## 2 2967 FALSE TX Somewhat satisfied 0
## 3 6095 FALSE NC Somewhat satisfied 0
## 4 3738 FALSE CO Somewhat satisfied 0
## 5 6741 FALSE ID Very satisfied 0
## 6 4839 FALSE MS Completely satisfied 1
## 7 3828 FALSE TX Somewhat satisfied 0
## 8 5640 TRUE KY Completely satisfied 0
## 9 5210 TRUE CA Very satisfied 0
## 10 339 FALSE IN Somewhat satisfied 0
## # ... with 1,630 more rows, and 2 more variables: employment_status <chr>,
## # annual_earnings <dbl>
library(ggplot2)
base_plot <- ggplot(ta, aes(life_satisfaction, annual_earnings)) + scale_y_log10()
base_plot + geom_point()
9.2 Jitter
When many points overlap, using geom_jitter
adjusts the position of each point to minimize overlap.
base_plot + geom_jitter()
See how this compares to using alpha
(i.e., opacity) to see how many points are in a given position:
base_plot + geom_point(alpha = 0.1)
9.3 Rug
Often when there are many points, we want to plot a summary that presents the general shape of the data.
base_plot + geom_violin()
The geom_rug
gives us a rug plot that we can use to highlight where actual observations occured when we create these summary plots.
base_plot + geom_violin() + geom_rug()
Both alpha
and jitter
can be applied to the rug as well.
base_plot + geom_violin() + geom_rug(alpha = 0.1, position = "jitter")
Now we can see that rug did in fact create a line on the x-axis for each observation. We can use the sides
property to only display the rug on the left and right ("lr"
)
base_plot + geom_violin() + geom_rug(alpha = 0.1, position = "jitter", sides = "lr")
9.4 Aesthetics
Aesthetics are the visual properties that define each graph. In ggplot, the aes()
function is used to create a mapping from your data to these visual properties. We most often use aes()
to map our variables to the x
and y
dimension. Above, we used the fact that the first two arguments to aes()
are x
and y
. That is,
aes(x = life_satisfaction, y = annual_earnings)
gives the same result as
aes(life_satisfaction, annual_earnings)
Anytime we want more than two dimensions of our data displayed, it’s useful to map the extra variables to other features of our graph (e.g., size, color, alpha, etc.). Let’s add number of children to our graph assigning children to color. (In ggplot both American and British spellings are supported, so you can use color
or colour
.)
ggplot(ta, aes(life_satisfaction, annual_earnings, color = children)) +
scale_y_log10() +
geom_jitter()
If the children
variable was a factor, more distinct colors would have been chosen. We can see this by coloring by employment_status
instead.
ggplot(ta, aes(life_satisfaction, annual_earnings, color = employment_status)) +
scale_y_log10() +
geom_jitter()
You can emphasize a variable by encoding in more than one visual aesthetic. Let’s map children
to color
, size
, and alpha
.
ggplot(ta, aes(life_satisfaction, annual_earnings, color = children, size = children, alpha = children)) +
scale_y_log10("Annual Earnings", labels = scales::dollar) +
geom_jitter()
9.5 Assignment
Open the codebook (data/TA2015_codebook.pdf
) and search for two new variables to visualize. Create a plot that makes use of both jitter and rug.