Lecture 14 Log scales
The focus of this lecture is on log scales. The goal is to give you an intuition about when to use log scales and how to interpret them. We will refresh on basic log transformations and see how they affect how we can encode data in a variety of visualizations.
This lecture uses the following packages:
tidyverse
14.1 Basic Log Review
14.1.1 Order of magnitude
Remember that the log of a number is the exponent required to transform the base into the input.
So that,
\[ log_{10} (x) = y \]
implies
\[ 10^y = x \]
Let’s use a concrete example where \(x\) is a vector of the integers from 1 to 100.
x <- 1:100
x
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
Let’s store the base 10 log of \(x\) in the vector \(y\):
y <- log(x, base = 10)
y <- log10(x)
y
## [1] 0.0000000 0.3010300 0.4771213 0.6020600 0.6989700 0.7781513 0.8450980
## [8] 0.9030900 0.9542425 1.0000000 1.0413927 1.0791812 1.1139434 1.1461280
## [15] 1.1760913 1.2041200 1.2304489 1.2552725 1.2787536 1.3010300 1.3222193
## [22] 1.3424227 1.3617278 1.3802112 1.3979400 1.4149733 1.4313638 1.4471580
## [29] 1.4623980 1.4771213 1.4913617 1.5051500 1.5185139 1.5314789 1.5440680
## [36] 1.5563025 1.5682017 1.5797836 1.5910646 1.6020600 1.6127839 1.6232493
## [43] 1.6334685 1.6434527 1.6532125 1.6627578 1.6720979 1.6812412 1.6901961
## [50] 1.6989700 1.7075702 1.7160033 1.7242759 1.7323938 1.7403627 1.7481880
## [57] 1.7558749 1.7634280 1.7708520 1.7781513 1.7853298 1.7923917 1.7993405
## [64] 1.8061800 1.8129134 1.8195439 1.8260748 1.8325089 1.8388491 1.8450980
## [71] 1.8512583 1.8573325 1.8633229 1.8692317 1.8750613 1.8808136 1.8864907
## [78] 1.8920946 1.8976271 1.9030900 1.9084850 1.9138139 1.9190781 1.9242793
## [85] 1.9294189 1.9344985 1.9395193 1.9444827 1.9493900 1.9542425 1.9590414
## [92] 1.9637878 1.9684829 1.9731279 1.9777236 1.9822712 1.9867717 1.9912261
## [99] 1.9956352 2.0000000
Notice in particular the following pattern:
log10(c(1, 10, 100, 1000, 10000))
## [1] 0 1 2 3 4
It is useful to think of the logarithm (log) as recording the order of magnitude of the input value.
library(tidyverse)
ggplot(data.frame(x = x, y = y), aes(x, y)) +
geom_point()
From looking at the scatter plot of \(x\) and \(y\) you should notice that the space between larger values is diminished relative to the space between smaller values.
14.1.2 Percent change
A useful feature of logged variables is that the difference between two logged values is proportional to the percentage change in the original values. So, if the difference between two logged values is the same as the difference between two other logged values, the percentage change across both sets is the same.
log10(110) - log10(100)
## [1] 0.04139269
(110 - 100)/100
## [1] 0.1
log10(220) - log10(200)
## [1] 0.04139269
(220 - 200)/200
## [1] 0.1
This is useful in visualizations because similar gaps in a log variable in different regions of the chart correspond to the same percentage change.
To test this let’s create a variable that grows by a fixed percentage.
constant_growth = 100 * (1.10) ^ (1:100)
constant_growth
## [1] 110.0000 121.0000 133.1000 146.4100 161.0510
## [6] 177.1561 194.8717 214.3589 235.7948 259.3742
## [11] 285.3117 313.8428 345.2271 379.7498 417.7248
## [16] 459.4973 505.4470 555.9917 611.5909 672.7500
## [21] 740.0250 814.0275 895.4302 984.9733 1083.4706
## [26] 1191.8177 1310.9994 1442.0994 1586.3093 1744.9402
## [31] 1919.4342 2111.3777 2322.5154 2554.7670 2810.2437
## [36] 3091.2681 3400.3949 3740.4343 4114.4778 4525.9256
## [41] 4978.5181 5476.3699 6024.0069 6626.4076 7289.0484
## [46] 8017.9532 8819.7485 9701.7234 10671.8957 11739.0853
## [51] 12912.9938 14204.2932 15624.7225 17187.1948 18905.9142
## [56] 20796.5057 22876.1562 25163.7719 27680.1490 30448.1640
## [61] 33492.9803 36842.2784 40526.5062 44579.1568 49037.0725
## [66] 53940.7798 59334.8578 65268.3435 71795.1779 78974.6957
## [71] 86872.1652 95559.3818 105115.3200 115626.8519 127189.5371
## [76] 139908.4909 153899.3399 169289.2739 186218.2013 204840.0215
## [81] 225324.0236 247856.4260 272642.0686 299906.2754 329896.9030
## [86] 362886.5933 399175.2526 439092.7778 483002.0556 531302.2612
## [91] 584432.4873 642875.7360 707163.3096 777879.6406 855667.6047
## [96] 941234.3651 1035357.8016 1138893.5818 1252782.9400 1378061.2340
qplot(1:100, constant_growth)
qplot(1:100, log(constant_growth))
diff(x)
returns a vector of the differences between consecutive values of x
:
diff(constant_growth)
## [1] 11.00000 12.10000 13.31000 14.64100 16.10510
## [6] 17.71561 19.48717 21.43589 23.57948 25.93742
## [11] 28.53117 31.38428 34.52271 37.97498 41.77248
## [16] 45.94973 50.54470 55.59917 61.15909 67.27500
## [21] 74.00250 81.40275 89.54302 98.49733 108.34706
## [26] 119.18177 131.09994 144.20994 158.63093 174.49402
## [31] 191.94342 211.13777 232.25154 255.47670 281.02437
## [36] 309.12681 340.03949 374.04343 411.44778 452.59256
## [41] 497.85181 547.63699 602.40069 662.64076 728.90484
## [46] 801.79532 881.97485 970.17234 1067.18957 1173.90853
## [51] 1291.29938 1420.42932 1562.47225 1718.71948 1890.59142
## [56] 2079.65057 2287.61562 2516.37719 2768.01490 3044.81640
## [61] 3349.29803 3684.22784 4052.65062 4457.91568 4903.70725
## [66] 5394.07798 5933.48578 6526.83435 7179.51779 7897.46957
## [71] 8687.21652 9555.93818 10511.53200 11562.68519 12718.95371
## [76] 13990.84909 15389.93399 16928.92739 18621.82013 20484.00215
## [81] 22532.40236 24785.64260 27264.20686 29990.62754 32989.69030
## [86] 36288.65933 39917.52526 43909.27778 48300.20556 53130.22612
## [91] 58443.24873 64287.57360 70716.33096 77787.96406 85566.76047
## [96] 94123.43651 103535.78016 113889.35818 125278.29400
Taking a look at the differences of the logged constant growth variable we see that we the change across values is now constant.
diff(log(constant_growth))
## [1] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [7] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [13] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [19] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [25] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [31] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [37] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [43] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [49] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [55] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [61] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [67] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [73] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [79] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [85] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [91] 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018 0.09531018
## [97] 0.09531018 0.09531018 0.09531018
14.1.3 Histogram comparison
We can also compare the histogram of a normal distribution to a distribition that would be normal if we applied logs.
qplot(rnorm(10000, mean = 10, sd = 1), main = "Normal Distribution")
qplot(exp(rnorm(10000, mean = 10, sd = 1)), main = "Log-normal distribution")
14.2 Data
For a practical application of logs we’ll look back at income. This time we’re using zip-code level data from the IRS Statistics of Income (SOI) program.
There are a variety of datasets available on individual income (form 1040). From the main zip-code data page, click the link to the 2015 data, then download the CSV for all states without AGI. The file with AGI breaks down the observations further into ranges of adjusted gross income (AGI). We will instead focus on the zip-level data across all AGI groups. You will also want to download the documentation to determine which variables we want to keep and how to interpret them.
raw_irs <- read_csv("data/15zpallnoagi.csv")
irs <- raw_irs %>%
filter(ZIPCODE != '00000') %>% # remove state-level summary
transmute(
zip = ZIPCODE,
state = STATE,
households = N1,
population = N2,
agi = A00100,
agi_pc = agi / population,
total_income = A02650,
wages = A00200,
farms = SCHF,
farm_proportion = farms / households,
taxes = A10300,
taxes_pc = taxes / population,
taxes_agi = taxes / agi,
taxes_total_income = taxes / total_income
)
as.tibble(irs)
## # A tibble: 27,729 x 14
## zip state households population agi agi_pc total_income wages
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 35004 AL 5110 10390 280757 27.02185 283221 231704
## 2 35005 AL 3260 6450 130927 20.29876 131832 105561
## 3 35006 AL 1230 2640 59415 22.50568 59695 46379
## 4 35007 AL 12170 25760 693284 26.91320 701514 554651
## 5 35010 AL 8160 17020 378765 22.25411 382646 261568
## 6 35014 AL 1610 3150 73119 23.21238 73864 56756
## 7 35016 AL 7010 14310 357800 25.00349 361470 263648
## 8 35019 AL 890 1930 37188 19.26839 37640 29847
## 9 35020 AL 9570 19680 257275 13.07292 259289 217474
## 10 35022 AL 9770 19160 546843 28.54087 552470 434932
## # ... with 27,719 more rows, and 6 more variables: farms <dbl>,
## # farm_proportion <dbl>, taxes <dbl>, taxes_pc <dbl>, taxes_agi <dbl>,
## # taxes_total_income <dbl>
14.3 Logs in position
14.3.1 Population size and AGI per capita
Let’s take a look first at the distribution of population and adjusted gross income per capita across zipcodes:
ggplot(irs, aes(population, agi_pc)) +
geom_point()
Remember that the log transformation compresses the space between larger values. Our scatter plot indicates that smaller values are currently more compressed. This is a signal that we should log transform our variables. The log transformation is useful for some of the assumptions of linear modelling. In the visual explorations here, the choice to log transform is largely aesthetic. If log transforming a variable makes it easier to visually inspect and understand your data, than it is useful.
14.3.2 Histogram comparison
Let’s look at the histogram for the population
:
ggplot(irs, aes(population)) + geom_histogram()
And now the logged version:
ggplot(irs, aes(population)) + geom_histogram() + scale_x_log10()
While the logged version of population
feels closer to a normal distribution, it is clearly not normally distributed. Even though population across zip codes is not exactly log-normally distributed, it is still useful for creating easier to interpret and analyze visualizations.
ggplot(irs, aes(population, agi_pc)) +
geom_point(alpha = 0.1) +
geom_smooth() +
scale_x_log10(labels = scales::comma) +
scale_y_log10()
14.3.3 Taxes and Farms
Let’s plot two proportions: the proportion of returns in a zipcode representing farms, farm_proportion
, and the proportion of total income that is tax liability, taxes_total_income
.
ggplot(irs, aes(farm_proportion, taxes_total_income)) +
geom_point() +
geom_smooth()
While the data is densely packed near the origin, it does not mean it would be appropriate to log these variables. Keep in mind that the log of 0 is undefined and in the limit from the right is negative infinty. We can manually remove these values with a filter.
ggplot(irs, aes(farm_proportion, taxes_total_income)) +
geom_point() +
geom_smooth() +
scale_x_continuous(labels = scales::percent) +
scale_y_log10(labels = scales::percent)
14.4 Logs in color
Let’s tweak the last visualization by encoding agi_pc
in the color of the points.
ggplot(irs %>% filter(farm_proportion > 0), aes(farm_proportion, taxes_total_income, color = agi_pc)) +
geom_point() +
scale_x_log10(labels = scales::percent) +
scale_y_log10(labels = scales::percent)
Using the transformation log10
(see trans
in the ?continuous_scale
documentation), we get a more gradual shift in color that makes it easier to see the changes in per capita adjusted gross income.
ggplot(irs %>% filter(farm_proportion > 0), aes(farm_proportion, taxes_total_income, color = agi_pc)) +
geom_point() +
scale_x_log10(labels = scales::percent) +
scale_y_log10(labels = scales::percent) +
scale_color_continuous(trans = "log10")
We can finally change the low
and high
colors and add alpha
to make our scatter plot easier to read. Note that I have added a log transformation to the size, which is tied to the population
variable. Remove the log transformation to see the difference.
library(scales)
ggplot(irs %>% filter(farm_proportion > 0), aes(farm_proportion, taxes_total_income, color = agi_pc)) +
geom_point(aes(size = population), alpha = 0.1) +
scale_x_log10(labels = scales::percent) +
scale_y_log10(labels = scales::percent) +
scale_color_continuous(trans = "log10", low = scales::muted("red"), high = scales::muted("blue")) +
scale_size_continuous(trans = "log10")
14.5 Assignment
Choose two different variables from the irs
dataset to visualize. Choose whether or not to log each variable in your visualization and explain why that was the right choice. Show at least one alternative (logged version vs. raw values) visualization and discuss how it compares to your preferred choice.