|
| 1 | +--- |
| 2 | +title: "Getting Started with DoubleML" |
| 3 | +output: rmarkdown::html_vignette |
| 4 | +vignette: > |
| 5 | + %\VignetteIndexEntry{Getting Started with DoubleML} |
| 6 | + %\VignetteEngine{knitr::rmarkdown} |
| 7 | + %\VignetteEncoding{UTF-8} |
| 8 | +--- |
| 9 | + |
| 10 | +```{r setup, include=FALSE} |
| 11 | +knitr::opts_chunk$set(echo = TRUE) |
| 12 | +knitr::opts_chunk$set(eval = TRUE) |
| 13 | +``` |
| 14 | + |
| 15 | +The purpose of the following case-studies is to demonstrate the core functionalities of `DoubleML`. |
| 16 | + |
| 17 | + |
| 18 | +## Installation |
| 19 | + |
| 20 | +The **DoubleML** package for R can be downloaded using (requires previous installation of the [`remotes` package](https://remotes.r-lib.org/index.html)). |
| 21 | + |
| 22 | +```{r, eval = FALSE} |
| 23 | +remotes::install_github("DoubleML/doubleml-for-r") |
| 24 | +``` |
| 25 | + |
| 26 | +Load the package after completed installation. |
| 27 | + |
| 28 | +```{r, message=FALSE, warning=FALSE} |
| 29 | +library(DoubleML) |
| 30 | +``` |
| 31 | + |
| 32 | +The python package `DoubleML` is available via the github repository. For more information, please visit our user guide. |
| 33 | + |
| 34 | +## Data |
| 35 | + |
| 36 | +For our case study we download the Bonus data set from the Pennsylvania Reemployment Bonus experiment and as a second example we simulate data from a partially linear regression model. |
| 37 | + |
| 38 | +```{r} |
| 39 | +library(DoubleML) |
| 40 | +
|
| 41 | +# Load bonus data |
| 42 | +df_bonus = fetch_bonus(return_type="data.table") |
| 43 | +head(df_bonus) |
| 44 | +
|
| 45 | +# Simulate data |
| 46 | +set.seed(3141) |
| 47 | +n_obs = 500 |
| 48 | +n_vars = 100 |
| 49 | +theta = 3 |
| 50 | +X = matrix(rnorm(n_obs*n_vars), nrow=n_obs, ncol=n_vars) |
| 51 | +d = X[,1:3]%*%c(5,5,5) + rnorm(n_obs) |
| 52 | +y = theta*d + X[, 1:3]%*%c(5,5,5) + rnorm(n_obs) |
| 53 | +``` |
| 54 | + |
| 55 | + |
| 56 | +## The causal model |
| 57 | + |
| 58 | +\begin{align*} |
| 59 | +Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}(\zeta | D,X) = 0, \\ |
| 60 | +D = m_0(X) + V, & &\mathbb{E}(V | X) = 0, |
| 61 | +\end{align*} |
| 62 | +where $Y$ is the outcome variable and $D$ is the policy variable of interest. |
| 63 | +The high-dimensional vector $X = (X_1, \ldots, X_p)$ consists of other confounding covariates, |
| 64 | +and $\zeta$ and $V$ are stochastic errors. |
| 65 | + |
| 66 | +## The data-backend `DoubleMLData` |
| 67 | + |
| 68 | +`DoubleML` provides interfaces to objects of class [`data.table`](https://rdatatable.gitlab.io/data.table/) as well as R base classes `data.frame` and `matrix`. Details on the data-backend and the interfaces can be found in the user guide. The `DoubleMLData` class serves as data-backend and can be initialized from a dataframe by specifying the column `y_col="inuidur1"` serving as outcome variable $Y$, the column(s) `d_cols = "tg"` serving as treatment variable $D$ and the columns `x_cols=c("female", "black", "othrace", "dep1", "dep2", "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54", "durable", "lusd", "husd")` specifying the confounders. Alternatively a matrix interface can be used as shown below for the simulated data. |
| 69 | + |
| 70 | + |
| 71 | +```{r} |
| 72 | +# Specify the data and variables for the causal model |
| 73 | +dml_data_bonus = DoubleMLData$new(df_bonus, |
| 74 | + y_col = "inuidur1", |
| 75 | + d_cols = "tg", |
| 76 | + x_cols = c("female", "black", "othrace", "dep1", "dep2", |
| 77 | + "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54", |
| 78 | + "durable", "lusd", "husd")) |
| 79 | +print(dml_data_bonus) |
| 80 | +
|
| 81 | +# matrix interface to DoubleMLData |
| 82 | +dml_data_sim = double_ml_data_from_matrix(X=X, y=y, d=d) |
| 83 | +dml_data_sim |
| 84 | +``` |
| 85 | + |
| 86 | + |
| 87 | +## Learners to estimate the nuisance models |
| 88 | + |
| 89 | +To estimate our partially linear regression (PLR) model with the double machine learning algorithm, we first have to specify machine learners to estimate $m_0$ and $g_0$. For the bonus data we use a random forest regression model and for our simulated data from a sparse partially linear model we use a Lasso regression model. The implementation of `DoubleML` is based on the meta-packages [mlr3](https://mlr3.mlr-org.com/) for R. For details on the specification of learners and their hyperparameters we refer to the user guide Learners, hyperparameters and hyperparameter tuning. |
| 90 | + |
| 91 | +```{r} |
| 92 | +library(mlr3) |
| 93 | +library(mlr3learners) |
| 94 | +# surpress messages from mlr3 package during fitting |
| 95 | +lgr::get_logger("mlr3")$set_threshold("warn") |
| 96 | +
|
| 97 | +learner = lrn("regr.ranger", num.trees=500, mtry=floor(sqrt(n_vars)), max.depth=5, min.node.size=2) |
| 98 | +ml_g_bonus = learner$clone() |
| 99 | +ml_m_bonus = learner$clone() |
| 100 | +
|
| 101 | +learner = lrn("regr.glmnet", lambda = sqrt(log(n_vars)/(n_obs))) |
| 102 | +ml_g_sim = learner$clone() |
| 103 | +ml_m_sim = learner$clone() |
| 104 | +``` |
| 105 | + |
| 106 | + |
| 107 | +## Cross-fitting, DML algorithms and Neyman-orthogonal score functions |
| 108 | + |
| 109 | +When initializing the object for PLR models `DoubleMLPLR`, we can further set parameters specifying the resampling: |
| 110 | + |
| 111 | +* The number of folds used for cross-fitting `n_folds` (defaults to `n_folds = 5`) as well as |
| 112 | +* the number of repetitions when applying repeated cross-fitting `n_rep` (defaults to `n_rep = 1`). |
| 113 | + |
| 114 | +Additionally, one can choose between the algorithms `"dml1"` and `"dml2"` via `dml_procedure` (defaults to `"dml2"`). Depending on the causal model, one can further choose between different Neyman-orthogonal score / moment functions. For the PLR model the default score is `"partialling out"`. |
| 115 | + |
| 116 | +The user guide provides details about the Sample-splitting, cross-fitting and repeated cross-fitting, the Double machine learning algorithms and the Score functions |
| 117 | + |
| 118 | + |
| 119 | +## Estimate double/debiased machine learning models |
| 120 | + |
| 121 | +We now initialize `DoubleMLPLR` objects for our examples using default parameters. The models are estimated by calling the `fit()` method and we can for example inspect the estimated treatment effect using the `summary()` method. A more detailed result summary can be obtained via the `print()` method. Besides the `fit()` method `DoubleML` model classes also provide functionalities to perform statistical inference like `bootstrap()`, `confint()` and `p_adjust()`, for details see the user guide Variance estimation, confidence intervals and boostrap standard errors. |
| 122 | + |
| 123 | +```{r} |
| 124 | +set.seed(3141) |
| 125 | +obj_dml_plr_bonus = DoubleMLPLR$new(dml_data_bonus, ml_g=ml_g_bonus, ml_m=ml_m_bonus) |
| 126 | +obj_dml_plr_bonus$fit() |
| 127 | +print(obj_dml_plr_bonus) |
| 128 | +
|
| 129 | +obj_dml_plr_sim = DoubleMLPLR$new(dml_data_sim, ml_g=ml_g_sim, ml_m=ml_m_sim) |
| 130 | +obj_dml_plr_sim$fit() |
| 131 | +print(obj_dml_plr_sim) |
| 132 | +``` |
| 133 | + |
| 134 | + |
| 135 | + |
| 136 | + |
| 137 | + |
| 138 | + |
0 commit comments