【python因果推断库11】工具变量回归与使用 pymc 验证工具变量4

news2025/4/8 0:45:35

Wald 估计与简单控制回归的比较

CausalPy 和多变量模型

感兴趣的系数

复杂化工具变量公式

Wald 估计与简单控制回归的比较

但现在我们可以将这个估计与仅包含教育作为控制变量的简单回归进行比较。

naive_reg_model, idata_reg = make_reg_model(
    covariate_df.assign(education=df["education"])
)
az.summary(idata_reg, var_names=["beta_z"])[
    ["mean", "sd", "hdi_3%", "hdi_97%", "r_hat"]
]

在这里，我们看到包含我们的工具变量和处理变量的回归中，分配给我们的工具变量 `nearcollege_indicator` 的系数权重 beta_z[nearcollege_indicator] 进一步向 0 缩小。这在一定程度上表明排除限制假设仍然是合理的。工具变量的影响被吸收到了处理变量更直接的影响中。

ols_estimate = az.extract(idata_reg["posterior"])["beta_z"].sel(covariates="education")
fig, axs = plt.subplots(2, 1, figsize=(7, 9))
axs = axs.flatten()
ax = axs[0]
ax1 = axs[1]
ax.hist(
    estimate,
    bins=30,
    ec="black",
    alpha=0.5,
    label=r"IV $\beta$ Education",
    rasterized=True,
)
ax1.hist(
    ols_estimate,
    bins=30,
    ec="black",
    alpha=0.5,
    label=r"Simple $\beta$ Education",
    color="red",
    rasterized=True,
)
ax.axvline(
    np.mean(estimate),
    linestyle="--",
    color="k",
    label=f"Expected IV Estimate: {np.round(np.mean(estimate.values), 2)}",
)
ax1.axvline(
    np.mean(ols_estimate),
    linestyle="--",
    color="k",
    label=f"Expected: {np.round(np.mean(ols_estimate.values), 2)}",
)
ax1.set_xlabel(r"$\beta$ coefficient Education")

ax.legend()
ax1.legend(loc="upper left")
ax.set_title(
    "Estimated IV Effect \n  Returns to Schooling",
)
ax1.set_title("Estimated Simple Effect \n  Returns to Schooling");

注意这里简单回归和工具变量估计之间的显著差异。这种对比在许多方面是工具变量设计的核心。通过为我们的问题提出一个工具变量模型，我们争论的是简单回归和工具变量估计之间的差异是由于混淆变量的影响，这种影响扭曲了我们对处理变量对结果的理解。工具变量设计旨在消除这种扭曲效应。了解这些估计之间的差异大小可以让我们感受到所谓的混淆变量所产生的影响。

CausalPy 和多变量模型

现在我们使用 CausalPy 的贝叶斯工具变量回归来拟合模型。在这里，我们可以明确地陈述构成我们模型的结构方程。重要的是，我们确保包含在工具变量公式中的控制变量也被包含在结果公式中。

sample_kwargs = {
    "chains": 4,
    "cores": 4,
    "target_accept": 0.95,
    "progressbar": True,
    "nuts_sampler": "numpyro",  ## requires Jax and Numpyro install
    "idata_kwargs": {"log_likelihood": True},
}
instruments_formula = "education ~ 1 + experience_1 + experience_2 + ethnicity_indicator + south_indicator + smsa_indicator + nearcollege_indicator"
formula = "log_wage ~ 1 + education  + experience_1 + experience_2 + ethnicity_indicator + south_indicator + smsa_indicator"
instruments_data = df[
    [
        "education",
        "nearcollege_indicator",
        "experience_1",
        "experience_2",
        "ethnicity_indicator",
        "smsa_indicator",
        "south_indicator",
    ]
]
data = df[
    [
        "log_wage",
        "education",
        "experience_1",
        "experience_2",
        "ethnicity_indicator",
        "smsa_indicator",
        "south_indicator",
    ]
]
iv = InstrumentalVariable(
    instruments_data=instruments_data,
    data=data,
    instruments_formula=instruments_formula,
    formula=formula,
    model=InstrumentalVariableRegression(sample_kwargs=sample_kwargs),
)

az.summary(iv.idata, var_names=["beta_t", "beta_z"])[
    ["mean", "sd", "hdi_3%", "hdi_97%", "r_hat"]
]

感兴趣的系数

如我们所见，beta_z[education] 系数记录了我们的 LATE 估计，并且实质上恢复了与上面的两步 Wald 估计相同的价值。同时请注意，experience_1 变量似乎与其他变量处于不同的数量级。

默认情况下，InstrumentalVariable 类不会从先验预测分布或后验预测分布中采样，就像典型的 CausalPy 模型那样。这主要是因为在工具变量回归中，重点在于 beta_z 和 beta_t 参数，以及在 beta_z[education] 上记录的处理效应的焦点参数。

然而，在模型估计之后完全有可能从后验预测分布中采样。如果您确实希望从后验预测分布中采样，我们强烈建议安装并使用 Jax 采样器进行后验预测采样，因为它通常比基础的 pymc 采样器快得多。

iv.model.sample_predictive_distribution(ppc_sampler="jax")

同样地，我们也可以提取先验预测检查，并观察后验分布如何更新了我们的先验。

with iv.model:
    iv.idata.extend(pm.sample_prior_predictive(var_names=["beta_z"]))
az.plot_dist_comparison(
    iv.idata, var_names=["beta_z"], coords={"covariates": ["education"]}, figsize=(8, 4)
);

上面的图展示了我们对处理效应可能实现的广泛假设，以及在考虑到观测数据的情况下，可能实现的狭窄范围。

复杂化工具变量公式

我们可以通过添加额外的工具变量来进一步评估加强工具变量效应的想法。一个自然的想法是观察当我们添加额外的 `nearcollege2_indicator` 时，教育方程中的工具变量值如何变化。从我们对数据的视觉检查来看，似乎有必要尝试确定接近两年制和四年制大学如何影响教育程度。

instruments_formula = """education ~  experience_1 + experience_2 + ethnicity_indicator + south_indicator + 
                                      smsa_indicator + nearcollege_indicator + nearcollege2_indicator"""

formula = "log_wage ~ 1 + education  + experience_1 + experience_2 + ethnicity_indicator + south_indicator + smsa_indicator"

instruments_data = df[
    [
        "education",
        "nearcollege_indicator",
        "nearcollege2_indicator",
        "experience_1",
        "experience_2",
        "ethnicity_indicator",
        "smsa_indicator",
        "south_indicator",
    ]
]

data = df[
    [
        "log_wage",
        "education",
        "experience_1",
        "experience_2",
        "ethnicity_indicator",
        "smsa_indicator",
        "south_indicator",
    ]
]
iv1 = InstrumentalVariable(
    instruments_data=instruments_data,
    data=data,
    instruments_formula=instruments_formula,
    formula=formula,
    model=InstrumentalVariableRegression(sample_kwargs=sample_kwargs),
)
iv1.model.sample_predictive_distribution(ppc_sampler="jax")

az.summary(iv1.idata, var_names=["beta_t", "beta_z"])[
    ["mean", "sd", "hdi_3%", "hdi_97%", "r_hat"]
]

在这里，我们看到额外工具变量 `beta_t[nearcollege2_indicator]` 和原有工具变量 `beta_t[nearcollege_indicator]` 的加入使得 LATE 估计值从 0.13 提升到了 0.16。这在直觉上是合理的，并且或许增强了整体观点，即接近度是一个好的工具变量。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2114569.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！