Labs
Welcome to our Labs!
OLS and Lasso for wage prediction
An important question in labor economics is what determines the wage of workers. This is a causal question, but we can begin to investigate it from a predictive perspective. In the following wage example, 𝑌 is the (log) hourly wage of a worker and 𝑋 is a vector of worker's characteristics, e.g., education, experience, gender.
The Gender Wage Gap
In the previous lab, we analyzed data from the March Supplement of the U.S. Current Population Survey (2015) and answered the question of how to use job-relevant characteristics, such as education and experience, to best predict wages.
Exercise on Overfitting
Set of simple simulations that show how measures of fit perform in a high p/n setting.
VaccinationRCT
One of the earliest randomized experiments were the Polio vaccination trials conducted by the Public Health Service in 1954. The question was whether Salk vaccine prevented polio. Children in the study were randomly assigned either a treatment (polio vaccine shot) or a placebo (saline solution shot), without knowing which one they received. The doctors in the study, making the diagnosis, did not know whether a child received a vaccine or not. In other words, the trial was a double-blind, randomized controlled trial.
Covariates in RCT
An economic motivation for this example could be provided as follows: Let D be the treatment of going to college, and let 𝑍 be academic skills. Suppose that academic skills cause lower earnings Y(0) in jobs that don't require a college degree, and cause higher earnings Y(1) in jobs that require college degrees. This type of scenario is reflected in the DGP set-up above.
Reemployment Bonus RCT
In this lab, we analyze the Pennsylvania re-employment bonus experiment, which was previously studied in "Sequential testing of duration data: the case of the Pennsylvania ‘reemployment bonus’ experiment" (Bilias, 2000), among others.
Penalized Linear Regressions: A Simulation Experiment
Implementation of different penalized regression methods and examine their performance for approximating regression functions in a simulation experiment. The simulation experiment includes one case with approximate sparsity and another case with both approximately sparse and dense components.
Case Study using Wage Data from 2015
We illustrate how to predict an outcome variable Y in a high-dimensional setting, where the number of covariates 𝑝 is large in relation to the sample size 𝑛 . So far we have used linear prediction rules, e.g. Lasso regression, for estimation. Now, we also consider nonlinear prediction rules including tree-based methods.
Simulation on Orthogonal Estimation
Simulation experiment comparing orthogo- nal (partialling-out) with non-orthogonal learning (naive method).
Comparing orthogonal (partialling-out) with non-orthogonal learning.
Simulation experiment comparing orthogonal methods with non-orthogonal methods.
Testing the Convergence Hypothesis
Double Lasso analysis of the conditional convergence hypothesis in growth economics.
Heterogeneous Effect of Sex on Wage Using Double Lasso
We use US census data from the year 2012 to analyse the effect of gender and interaction effects of other variables with gender on wage jointly. The dependent variable is the logarithm of the wage, the target variable is female (in combination with other variables). All other variables denote some other socio-economic characteristics, e.g. marital status, education, and experience. For a detailed description of the variables we refer to the help page.
Collider Bias
Here the idea is that people who get to Hollywood have to have high congenility = talent + beauty. Funnily enough this induces a negative correlation between talents and looks, when we condition on the set of actors or celebrities. This simple example explains an anecdotal observation that "talent and beaty are negatively correlated" for celebrities.
Causal Identification in DAGs using Backdoor and Swigs, Equivalence Classes, Falsifiability Tests
List all conditional independence in a DAG; these are obtained by using the graphical d- separation criterion. We then go ahead and test those restrictions assuming a linear ASEM structure. The note- book also illustrates the analysis from the next chapter.
Dosearch for Causal Identification in DAGs.
This a simple notebook for teaching that illustrates capabilites of the "dosearch" package, which is a great tool. NB. In my experience, the commands are sensitive to syntax ( e.g. spacing when -> are used), so be careful when changing to other examples.
Machine Learning Estimators for Wage Prediction
We illustrate how to predict an outcome variable 𝑌 in a high-dimensional setting, where the number of covariates 𝑝 is large in relation to the sample size 𝑛 . So far we have used linear prediction rules, e.g. Lasso regression, for estimation. Now, we also consider nonlinear prediction rules including tree-based methods.
Functional Approximations by Trees and Neural Networks
Illustrate the flexibility of these methods in approximating the function exp(4G).
The Effect of Gun Ownership on Gun-Homicide Rates
Provide anapplication of DML inference to learn predictive/causal effects of gun ownership on homicide rates across U.S. counties
Dagitty in the Analysis of Impact of 401(k) on Net Financial Wealth
Analyze graph structures that enable identification of the causal effect of 401(K) eligibility on net financial wealth.
Inference on Predictive and Causal Effects in High-Dimensional Nonlinear Models
Provide application of DML inference to learn predictive/causal effects of 401(K) eligibility on net financial wealth. (Note: The results produced in this notebook and provided in the text are slightly different than those in the original paper. The replication files are given at the following Github repository. The difference is due to our use of a single split of the sample in producing the results for this text while the baseline results are based on a method that aggregates results across multiple data splits.)
Double/Debiased Machine Learning for the Partially Linear Regression Model
This is a simple implementation of Debiased Machine Learning for the Partially Linear Regression Model, which provides an application of DML inference to determine the causal effect of countries' intitial wealth on the rate of economic growth.
Variational Autoencoders
In this notebook, we'll introduce and explore "variational autoencoders," which are a very successful family of models in modern deep learning. In particular we will: Illustrate the connection between autoencoders and classical Principal Component Analysis (PCA) Train a non-linear variational auto-encoder that uses a deep neural network
DoubleML and Feature Engineering with BERT
Provides an introduction to text embeddings via BERT and provides an application to predicting demand for toys.
Sensitivity Analysis for Unobserved Confounder with DML and Sensmakr
Analyses the sensitivity of the DML estimate in the Darfur wars example to unobserved confounders
Negative (Proxy) Controls for Unobserved Confounding
Provides an application of using proxy controls to estimate the effect of smoking on birth weight.
Inference on Predictive and Causal Effects in High-Dimensional Nonlinear Models
Estimate the Local Average Treatment Effects of 401(K) participation on net financial wealth.
Weak IV Experiments
A Simple Example of Properties of IV estimator when Instruments are Weak
Debiased ML for Partially Linear IV Model
Provides DML analysis of the impact of institutions on a country’s wealth following AER. The notebook first pro- ceeds with the analysis assuming strong identification. It then notes the weak instrument problem and performs DML analysis that is robust to weak identification.
CATE Inference
analyzes CATE of welfare experiment and for 401k experiment with Best Linear Predictors of CATE and with Random Forest and Causal Forest based methods.
CATE Inference
Analyzes Best Predictors of CATE for 401(K) conditional on income.
CATE Estimation
Analyzes CATE of welfare experiment and for 401k experiment with forests and other methods.
Regression Discontinuity
This notebook illustrates the use of Regression Discontinuity in an empirical study. We analyze the effect of the antipoverty program Progresa/Opportunidades on the consumption behavior of families in Mexico in the early 2000s.
Minimum Wage Example Notebook with DiD
This notebook implements Difference-in-Differences in an application on the effect of minimum wage changes on teen employment. We use data from Callaway (2022). The data are annual county level data from the United States covering 2001 to 2007. The outcome variable is log county-level teen employment, and the treatment variable is an indicator for whether the county has a minimum wage above the federal minimum wage.