Mixtape-style R code translations

Causal Inference: The Mixtape includes code examples in both Stata and R, which is so cool! The previous PDF-only version of the book only had Stata code, so it was hard to follow along with in this class.

As you’ll note throughout this class, there are lots of ways to do things in R. You’ll come across code in the RStudio forums and on StackOverflow that works, but looks slightly different than you’re used to. In this class, we use the tidyverse, or a set of packages that all work together nicely like ggplot2, dplyr, and others.

The Mixtape also uses tidyverse (yay!), but has a couple code quirks and differences. Because the HTML version of the book (with R code) only came out on January 3, I haven’t had time to find all the differences, so I’ll keep a running list here.

Loading data

Throughout The Mixtape, the code examples load data from external files stored online. The code provided for loading this data is a little more complicated than it needs to be. For instance, consider this from Section 4.0.1:

library(tidyverse)
library(haven)

read_data <- function(df)
{
  full_path <- paste("https://raw.github.com/scunning1975/mixtape/master/", 
                     df, sep = "")
  df <- read_dta(full_path)
  return(df)
}

yule <- read_data("yule.dta") %>% 
  lm(paup ~ outrelief + old + pop, .)
  
summary(yule)
## 
## Call:
## lm(formula = paup ~ outrelief + old + pop, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.48  -5.31  -1.83   3.13  25.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.1877    27.1439    2.33    0.027 *  
## outrelief     0.7521     0.1350    5.57  5.8e-06 ***
## old           0.0556     0.2234    0.25    0.805    
## pop          -0.3107     0.0669   -4.65  7.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.6 on 28 degrees of freedom
## Multiple R-squared:  0.697,	Adjusted R-squared:  0.665 
## F-statistic: 21.5 on 3 and 28 DF,  p-value: 2e-07

Here, The Mixtape authors created a custom function named read_data() that loads a Stata file from a URL at GitHub. They then run a model on that data and store the results as an object named yule. It works, but it’s not the most efficient way to do this for a few reasons:

  • You don’t need that whole custom read_data() function. Instead you can feed read_dta() a full URL and load it like that
  • It’s best if you save the data as an object by itself, then use it to create models. As it stands now, you can’t look at the yule.dta file or make scatterplots or do anything with the data, since the results that are stored as yule are the results from the model, not the data itself.

Instead, this is an easier way to load the data and run the model:

Translation

# Mixtape version
read_data <- function(df)
{
  full_path <- paste("https://raw.github.com/scunning1975/mixtape/master/", 
                     df, sep = "")
  df <- read_dta(full_path)
  return(df)
}

yule <- read_data("yule.dta") %>% 
  lm(paup ~ outrelief + old + pop, .)

summary(yule)
# Our class version
yule_data <- read_dta("https://raw.github.com/scunning1975/mixtape/master/yule.dta")

yule_model <- lm(paup ~ outrelief + old + pop,
                 data = yule_data)

summary(yule_model)
## 
## Call:
## lm(formula = paup ~ outrelief + old + pop, data = yule_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.48  -5.31  -1.83   3.13  25.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.1877    27.1439    2.33    0.027 *  
## outrelief     0.7521     0.1350    5.57  5.8e-06 ***
## old           0.0556     0.2234    0.25    0.805    
## pop          -0.3107     0.0669   -4.65  7.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.6 on 28 degrees of freedom
## Multiple R-squared:  0.697,	Adjusted R-squared:  0.665 
## F-statistic: 21.5 on 3 and 28 DF,  p-value: 2e-07

Even better, instead of using summary(yule_model), you can use tidy() and glance() from the broom library, which convert any kind of model object into standard data frames:

library(broom)  # You only need to run this once

tidy(yule_model)
## # A tibble: 4 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)  63.2      27.1        2.33  0.0274    
## 2 outrelief     0.752     0.135      5.57  0.00000583
## 3 old           0.0556    0.223      0.249 0.805     
## 4 pop          -0.311     0.0669    -4.65  0.0000725
glance(yule_model)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic     p.value    df logLik   AIC   BIC deviance df.residual  nobs
##       <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1     0.697         0.665  9.55      21.5 0.000000200     3  -115.  241.  248.    2552.          28    32

Side-by-side regression tables: stargazer vs. modelsummary

Side-by-side regression tables are a standard feature of economics, political science, and policy articles and reports. They conveniently let you see the results from multiple regresison models in one table. This blog post here goes into detail about how to read them.

There are several R packages for creating these tables automatically. One of the earliest is named stargazer, and tons of people still use it—including The Mixtape. For instance, Section 3.1.5 in The Mixtape shows an example of discrimination and collider bias and the code there shows three regression models side by side:

collider_discrimination.R:

set.seed(1234)

library(tidyverse)
library(stargazer)

tb <- tibble(
  female = ifelse(runif(10000) >= 0.5, 1, 0),
  ability = rnorm(10000),
  discrimination = female,
  occupation = 1 + 2*ability + 0*female - 2*discrimination + rnorm(10000),
  wage = 1 - 1*discrimination + 1*occupation + 2*ability + rnorm(10000) 
)

lm_1 <- lm(wage ~ female, tb)
lm_2 <- lm(wage ~ female + occupation, tb)
lm_3 <- lm(wage ~ female + occupation + ability, tb)

stargazer(lm_1, lm_2, lm_3, type = "html", 
          column.labels = c("Biased Unconditional", 
                            "Biased",
                            "Unbiased Conditional"))
Dependent variable:
wage
Biased UnconditionalBiasedUnbiased Conditional
(1)(2)(3)
female-2.900***0.560***-0.980***
(0.084)(0.029)(0.028)
occupation1.800***0.990***
(0.006)(0.010)
ability2.000***
(0.023)
Constant2.000***0.240***1.000***
(0.060)(0.020)(0.017)
Observations10,00010,00010,000
R20.1100.9100.950
Adjusted R20.1100.9100.950
Residual Std. Error4.200 (df = 9998)1.300 (df = 9997)1.000 (df = 9996)
F Statistic1,208.000*** (df = 1; 9998)50,089.000*** (df = 2; 9997)62,279.000*** (df = 3; 9996)
Note:*p<0.1; **p<0.05; ***p<0.01

Neat. That works, and there are a ton of additional options you can feed to stargazer()—run ?stargazer or search in the Help pane for “stargazer” to see more.

However, I’m not a fan of stargazer for a few reasons:

  • Stargazer tables won’t knit to Word
  • You have to use different syntax when you’re knitting to PDF vs. HTML (i.e. you have to change the type argument)
  • Stargazer handles basic models like lm() and glm() well, but doesn’t support newer things since (1) it hasn’t been updated for a while, and (2) support for any newer models has to be built-in by hand by the developer.

In this class we won’t use stargazer. Instead, we’ll use the newer modelsummary package, which addresses all of the issues with Stargazer:

  • modelsummary tables knit to Word
  • You don’t have to use any special syntax when knitting documents with modelsummary tables—knit to PDF and they’ll show up in the PDF; knit to Word and they’ll show up in the Word file; knit to HTML and they’ll show up there
  • modelsummary handles any model that is supported by the broom package (basically, if you ran run tidy() on it, you can stick it in a table with modelsummary). If a model doesn’t support broom yet, you can actually add that support yourself. For instance, nonparametric regression discontinuity models made with the rdrobust package don’t work in either stargazer or modelsummary. They’ll probably never be added to stargazer, but if you include the code here in your document, you can add rdrobust models to modelsummary(). Magic!

The syntax for modelsummary() is a little different from stargazer(). Here’s how to recreate the same table from before:

Translation

# Stargazer version
stargazer(lm_1, lm_2, lm_3, type = "text", 
          column.labels = c("Biased Unconditional", 
                            "Biased",
                            "Unbiased Conditional"))
# modelsummary version
library(modelsummary)  # You only need to run this once

modelsummary(list("Biased Unconditional" = lm_1, 
                  "Biased" = lm_2, 
                  "Unbiased Conditional" = lm_3),
             stars = TRUE)
Biased Unconditional Biased Unbiased Conditional
(Intercept) 2.015*** 0.245*** 1.005***
(0.060) (0.020) (0.017)
female -2.933*** 0.561*** -0.984***
(0.084) (0.029) (0.028)
occupation 1.790*** 0.995***
(0.006) (0.010)
ability 2.007***
(0.023)
Num.Obs. 10000 10000 10000
R2 0.108 0.909 0.949
R2 Adj. 0.108 0.909 0.949
AIC 57175.7 34320.7 28518.8
BIC 57197.3 34349.6 28554.9
Log.Lik. -28584.837 -17156.358 -14254.399
F 1208.290 50089.046 62279.460
* p < 0.1, ** p < 0.05, *** p < 0.01

That’s it! You feed modelsummary() a list of models. If you name the elements of the list, you’ll get column headings.

The modelsummary() function has a ton of options for changing the appearance, adding extra rows, omitting rows, highlighting rows, etc. See this page for examples.