当前位置：首页>后端>正文

R - tidymodels 之 infer

后端2024-10-22 15:09:39

属于 tidymodels，见 https://github.com/tidymodels/infer

ht-diagram.png

vignette("infer")

我统计学得太差了，难免翻译得有问题。

1. 介绍

infer 实现了一个与 tidyverse 设计框架一致的表达语法进行统计推断。这个包没有为特定的统计检验提供方法，而是将常见假设检验之间共享的原则整合到4个主要动词（函数）中，并辅以许多实用工具来可视化和提取它们的输出。

无论使用哪种假设检验，都在问同样的问题：我们观察到的数据中的效果/差异是真实的，还是偶然的？为了回答这个问题，首先假设观测到的数据来自某个“什么都没有发生”的世界（也就是说，观测到的效果只是由于随机因素），并把这个假设称为零假设。（实际上，我们可能根本不相信零假设——零假设与备择假设是对立的，备择假设认为观测数据中出现的效应实际上是由于“发生了一些事情”。）然后，我们从描述观察到效果的数据中计算一个检验统计量。我们可以使用这个检验统计量来计算p值，给出如果零假设为真，我们观察到的数据可能出现的概率。如果这个概率低于某个预先定义的显著性水平 $R - tidymodels 之 infer,α,第2张$ ，那么我们可以拒绝零假设。

这个包的工作流程是围绕这个思想设计的。从一些数据集开始，

specify(): 指定感兴趣的变量，或变量之间的关系。
hypothesize(): 声明零假设。
generate(): 生成反映零假设的数据。
calculate(): 根据生成的数据计算统计信息的分布，以形成零分布。

在这篇简介中，使用了infer提供的数据集gss，包含 General Social Survey 中11个变量的500个观察样本。

library(infer)
library(tidyverse)
# 加载数据
data(gss)
# 看一下结构
dplyr::glimpse(gss)
#> Rows: 500
#> Columns: 11
#> $ year    <dbl> 2014, 1994, 1998, 1996, 1994, 1996, 1990, 2016, 2~
#> $ age     <dbl> 36, 34, 24, 42, 31, 32, 48, 36, 30, 33, 21, 30, 3~
#> $ sex     <fct> male, female, male, male, male, female, female, f~
#> $ college <fct> degree, no degree, degree, no degree, degree, no ~
#> $ partyid <fct> ind, rep, ind, ind, rep, rep, dem, ind, rep, dem,~
#> $ hompop  <dbl> 3, 4, 1, 4, 2, 4, 2, 1, 5, 2, 4, 3, 4, 4, 2, 2, 3~
#> $ hours   <dbl> 50, 31, 40, 40, 40, 53, 32, 20, 40, 40, 23, 52, 3~
#> $ income  <ord> 000 or more, 000 - 24999, 000 or more, $~
#> $ class   <fct> middle class, working class, working class, worki~
#> $ finrela <fct> below average, below average, below average, abov~
#> $ weight  <dbl> 0.8960034, 1.0825000, 0.5501000, 1.0864000, 1.082~

2. specify(): 指定响应（和解释）变量

可以使用specify()函数指定数据集中你感兴趣的变量。如果你只对受访者的年龄感兴趣，你可以这样写：

gss %>%
  specify(response = age)
#> Response: age (numeric)
#> # A tibble: 500 x 1
#>      age
#>    <dbl>
#>  1    36
#>  2    34
#>  3    24
#>  4    42
#>  5    31
#>  6    32
#>  7    48
#>  8    36
#>  9    30
#> 10    33
#> # ... with 490 more rows
#> # i Use `print(n = ...)` to see more rows

在前端，specify() 的输出看起来就像是选择了指定数据框中的列。检查这个对象的类，可以看到，"infer"类被添加到数据框类的最前面，这个新类存储了一些额外的元数据。

gss %>%
  specify(response = age) %>%
  class()
#> [1] "infer"      "tbl_df"     "tbl"        "data.frame"

如果对两个变量感兴趣，例如，age和partyid，你可以用以下两种等价的方法之一指定它们的关系：

gss %>%
  specify(age ~ partyid)
#> Response: age (numeric)
#> Explanatory: partyid (factor)
#> # A tibble: 500 x 2
#>      age partyid
#>    <dbl> <fct>  
#>  1    36 ind    
#>  2    34 rep    
#>  3    24 ind    
#>  4    42 ind    
#>  5    31 rep    
#>  6    32 rep    
#>  7    48 dem    
#>  8    36 ind    
#>  9    30 rep    
#> 10    33 dem    
#> # ... with 490 more rows
#> # i Use `print(n = ...)` to see more rows

# 使用命名参数
gss %>%
  specify(response = age, explanatory = partyid)
#> Response: age (numeric)
#> Explanatory: partyid (factor)
#> # A tibble: 500 x 2
#>      age partyid
#>    <dbl> <fct>  
#>  1    36 ind    
#>  2    34 rep    
#>  3    24 ind    
#>  4    42 ind    
#>  5    31 rep    
#>  6    32 rep    
#>  7    48 dem    
#>  8    36 ind    
#>  9    30 rep    
#> 10    33 dem    
#> # ... with 490 more rows
#> # i Use `print(n = ...)` to see more rows

如果正在对一个比例或比例的差异进行推断，将需要使用success参数来指定response变量的哪个级别是 success。例如，如果你对拥有大学学位的人口比例感兴趣，你可以使用以下代码：

gss %>%
  specify(response = college, success = "degree")
#> Response: college (factor)
#> # A tibble: 500 x 1
#>    college  
#>    <fct>    
#>  1 degree   
#>  2 no degree
#>  3 degree   
#>  4 no degree
#>  5 degree   
#>  6 no degree
#>  7 no degree
#>  8 degree   
#>  9 degree   
#> 10 no degree
#> # ... with 490 more rows
#> # i Use `print(n = ...)` to see more rows

3. hypothesis(): 声明零假设

infer流程的下一步通常是使用hypothesis()声明一个零假设。第一步是将"independence"或"point"提供给null参数。如果你的零假设假定两个变量之间是独立的，那么这就是你需要提供给hypothesis()的全部内容：

gss %>%
  specify(college ~ partyid, success = "degree") %>%
  hypothesize(null = "independence")
#> Response: college (factor)
#> Explanatory: partyid (factor)
#> Null Hypothesis: independence
#> # A tibble: 500 x 2
#>    college   partyid
#>    <fct>     <fct>  
#>  1 degree    ind    
#>  2 no degree rep    
#>  3 degree    ind    
#>  4 no degree ind    
#>  5 degree    rep    
#>  6 no degree rep    
#>  7 no degree dem    
#>  8 degree    ind    
#>  9 degree    rep    
#> 10 no degree dem    
#> # ... with 490 more rows
#> # i Use `print(n = ...)` to see more rows

如果你正在对一个点估计做推断，你还需要提供p（在0到1之间的真实成功比例）、mu（真实平均值）、med（真实中值）或sigma（真实标准差）中的一个。例如，如果零假设是总体中每周平均工作时间为 40 小时，可以这样写：

gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40)
#> Response: hours (numeric)
#> Null Hypothesis: point
#> # A tibble: 500 x 1
#>    hours
#>    <dbl>
#>  1    50
#>  2    31
#>  3    40
#>  4    40
#>  5    40
#>  6    53
#>  7    32
#>  8    20
#>  9    40
#> 10    40
#> # ... with 490 more rows
#> # i Use `print(n = ...)` to see more rows

同样，从前端来看，从hypothesize()输出的数据框与从specify()输出的看起来几乎完全一样，但是infer现在“知道”了零假设。

4. generate(): 生成零分布

一旦使用hypothesis()声明了零假设，我们就可以基于这个假设构造一个零分布。我们可以使用type参数提供的几种方法之一来完成此操作：

bootstrap: 将为每次重复抽取一个 bootstrap 样本，其中，从输入样本数据中抽取（有替换地）一个与输入样本大小一样的样本。
permute: 对于每个重复，每个输入值将被随机地重新分配（没有替换地）到样本中的一个新的输出值。
draw: 对于每个重复，将从一个理论分布中抽取一个值，该分布的参数在hypothesis()中指定。这个选项目前只适用于检验点估计。这种生成类型以前称为"simulate"，现在已被取代。

继续上面的例子，关于每周工作的平均小时数，我们可以这样写：

set.seed(1)

gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  generate(reps = 1000, type = "bootstrap")
#> Response: hours (numeric)
#> Null Hypothesis: point
#> # A tibble: 500,000 x 2
#> # Groups:   replicate [1,000]
#>    replicate hours
#>        <int> <dbl>
#>  1         1 46.6 
#>  2         1 43.6 
#>  3         1 38.6 
#>  4         1 28.6 
#>  5         1 38.6 
#>  6         1 38.6 
#>  7         1  6.62
#>  8         1 78.6 
#>  9         1 38.6 
#> 10         1 38.6 
#> # ... with 499,990 more rows
#> # i Use `print(n = ...)` to see more rows

在上面的例子中，我们取 1000 个自举样本来形成零分布。

注意，在generate()之前，我们已经用set.seed()函数设置了随机数生成的种子。当使用infer包进行研究时，或者在其他情况下，当精确的再现性是优先级时，这是一个很好的实践。infer将遵循set.seed()函数中指定的随机种子，当generate()数据给定相同的种子时返回相同的结果。

为了生成两个变量独立性的零分布，我们还可以随机重新洗牌解释变量和响应变量的配对，以打破任何现有的关联。例如，在政党归属不受年龄影响的假设下，生成1000个可用于创建零分布的重复：

gss %>%
  specify(partyid ~ age) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute")
#> Response: partyid (factor)
#> Explanatory: age (numeric)
#> Null Hypothesis: independence
#> # A tibble: 500,000 x 3
#> # Groups:   replicate [1,000]
#>    partyid   age replicate
#>    <fct>   <dbl>     <int>
#>  1 rep        36         1
#>  2 rep        34         1
#>  3 dem        24         1
#>  4 dem        42         1
#>  5 dem        31         1
#>  6 ind        32         1
#>  7 ind        48         1
#>  8 rep        36         1
#>  9 dem        30         1
#> 10 rep        33         1
#> # ... with 499,990 more rows
#> # i Use `print(n = ...)` to see more rows

5. calculate(): 计算汇总统计

calculate()根据infer核心函数的输出计算汇总统计信息。该函数引入一个stat参数，目前是"mean"，"median"，"sum"，"sd"，"prop"，"count"，"diff in means"，"diff in median"，"diff in props"，"Chisq"，"F"，"t"，"z"，"slope"，或"correlation"之一。例如，继续上面的例子来计算每周平均工作时间的零分布：

gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")
#> Response: hours (numeric)
#> Null Hypothesis: point
#> # A tibble: 1,000 x 2
#>    replicate  stat
#>        <int> <dbl>
#>  1         1  39.2
#>  2         2  39.1
#>  3         3  39.0
#>  4         4  39.8
#>  5         5  41.4
#>  6         6  39.4
#>  7         7  39.8
#>  8         8  40.4
#>  9         9  41.5
#> 10        10  40.9
#> # ... with 990 more rows
#> # i Use `print(n = ...)` to see more rows

这里calculate()的输出显示了 1000 次重复中每一次的样本统计数据（在本例中是平均值）。如果要对平均值、中位数、比例或 $R - tidymodels 之 infer,t,第3张$ 和 $R - tidymodels 之 infer,z,第4张$ 统计量的差异进行推断，则需要提供一个order参数，给出解释变量应该被减去的顺序。例如，为了找出拥有大学学位和没有大学学位的人的平均年龄的差异，我们可以这样写：

gss %>%
  specify(age ~ college) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate("diff in means", order = c("degree", "no degree"))
#> Response: age (numeric)
#> Explanatory: college (factor)
#> Null Hypothesis: independence
#> # A tibble: 1,000 x 2
#>    replicate   stat
#>        <int>  <dbl>
#>  1         1 -2.35 
#>  2         2 -0.902
#>  3         3  0.403
#>  4         4 -0.426
#>  5         5  0.482
#>  6         6 -0.196
#>  7         7  1.33 
#>  8         8 -1.07 
#>  9         9  1.68 
#> 10        10  0.888
#> # ... with 990 more rows
#> # i Use `print(n = ...)` to see more rows

6. 其它工具

infer还提供了几个程序来提取汇总统计信息和分布的含义，包中提供了一些函数来可视化统计信息相对于分布的位置（使用visualize()）、计算p值（使用get_p_value()）和计算置信区间（使用get_confidence_interval()）。

为了说明，回到确定平均每周工作时间是否为 40 小时的例子。

# find the point estimate
obs_mean <- gss %>%
  specify(response = hours) %>%
  calculate(stat = "mean")

# generate a null distribution
null_dist <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

点估计 41.382 似乎非常接近 40 点，但又有一点不同。我们可能会想，这种差异是由于随机因素造成的，还是说总体中每周的平均工作时间真的不是 40 小时。

我们可以先可视化零分布。

null_dist %>%
  visualize()

样本的观察统计值在这个分布上位于哪里？可以使用obs_stat参数来指定这一点。

null_dist %>%
  visualize() +
  shade_p_value(obs_stat = obs_mean, direction = "two-sided")

注意，infer还遮蔽了零分布的区域，这些区域与我们观察到的统计数据一样（或更）极端。（另外，请注意，我们现在使用+操作符来应用shade_p_value函数。这是因为visualize()输出的绘图对象来自ggplot2而不是数据框，并且需要+运算符将p值层添加到绘图对象中。）红条看起来有点偏离零分布的右尾，所以如果均值是 40 小时，观察 41.382 小时的样本均值是不太可能的。

# get a two-tailed p-value
p_value <- null_dist %>%
  get_p_value(obs_stat = obs_mean, direction = "two-sided")

p_value
#> # A tibble: 1 x 1
#>   p_value
#>     <dbl>
#> 1   0.032

p值似乎是 0.032，这非常小，如果每周工作的真正平均小时数是 40，那么样本均值距离 40（1.382 小时）这么远的概率是0.032。这可能在统计上是显著差异，也可能不是，这取决于你在进行分析之前决定的显著性水平 $R - tidymodels 之 infer,α,第2张$ 。如果你让 $R - tidymodels 之 infer,α,第2张$ = .05，那么这个差异在统计上是显著的，但如果你设置 $R - tidymodels 之 infer,α,第2张$ = .01，那么就不显著。

为了得到估计的置信区间，我们可以这样写：

# generate a distribution like the null distribution, 
# though exclude the null hypothesis from the pipeline
boot_dist <- gss %>%
  specify(response = hours) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

# start with the bootstrap distribution
ci <- boot_dist %>%
  # calculate the confidence interval around the point estimate
  get_confidence_interval(point_estimate = obs_mean,
                          # at the 95% confidence level
                          level = .95,
                          # using the standard error
                          type = "se")

ci
#> # A tibble: 1 x 2
#>   lower_ci upper_ci
#>      <dbl>    <dbl>
#> 1     40.1     42.7

如你所见，每周 40 小时不包含在这个区间内，这与我们之前的结论一致，即这一发现在置信水平 $R - tidymodels 之 infer,α,第2张$ = .05上是显著的。为了直观地看到这个区间，我们可以使用shade_confidence_interval()工具：

boot_dist %>%
  visualize() +
  shade_confidence_interval(endpoints = ci)

7. 理论方法

{infer}还提供了对"Chisq"、"F"、"t"和"z"分布使用理论方法的功能。

通常，要使用基于理论的方法查找零分布，使用与在其他地方查找观察统计量相同的代码，将calculate()替换为assume()。例如，计算观察到的 $R - tidymodels 之 infer,t,第3张$ 统计量（标准化的平均值）:

# calculate an observed t statistic
obs_t <- gss %>%
  specify(response = hours) %>%
  hypothesize(null = "point", mu = 40) %>%
  calculate(stat = "t")

然后，定义一个理论 $R - tidymodels 之 infer,t,第3张$ 分布，我们可以这样写：

# switch out calculate with assume to define a distribution
t_dist <- gss %>%
  specify(response = hours) %>%
  assume(distribution = "t")

从这里开始，理论分布以与基于模拟的零分布相同的方式进行接口。例如，要使用p值：

# visualize the theoretical null distribution
visualize(t_dist) +
  shade_p_value(obs_stat = obs_t, direction = "greater")

# more exactly, calculate the p-value
get_p_value(t_dist, obs_t, "greater")
#> # A tibble: 1 x 1
#>   p_value
#>     <dbl>
#> 1  0.0188

置信区间存在于数据的尺度上，而不是理论分布的标准化尺度上，因此在处理置信区间时，一定要使用非标准化的观察统计数据。

# find the theory-based confidence interval
theor_ci <- 
  get_confidence_interval(
    x = t_dist,
    level = .95,
    point_estimate = obs_mean)

theor_ci
#> # A tibble: 1 x 2
#>   lower_ci upper_ci
#>      <dbl>    <dbl>
#> 1     40.1     42.7

在可视化时，将 $R - tidymodels 之 infer,t,第3张$ 分布进行重定位和缩放，使其与观测数据的尺度一致。

# visualize the theoretical sampling distribution
visualize(t_dist) +
  shade_confidence_interval(theor_ci)

8. 多元回归

为了适应基于随机的推断与多个解释变量，该包实现了一个基于模型拟合的替代工作流。与从重抽样数据中calculate()统计数据不同，该包允许根据零假设重抽样数据fit()线性模型，为每个解释变量提供模型系数。在大多数情况下，在基于calculate()的工作流中，只需将calculate()切换为fit()。

例如，假设想用受访者的age和college毕业状态来拟合每周的工作hours。我们首先可以根据观测数据拟合一个线性模型。

observed_fit <- gss %>%
  specify(hours ~ age + college) %>%
  fit()

现在，为每一项生成零分布，可以将 1000 个模型放入gss数据集的重抽样中，其中响应hours被排列在每个模型中。注意，这段代码与上面的代码相同，只是增加了hypothesize()和generate()步骤。

null_fits <- gss %>%
  specify(hours ~ age + college) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  fit()

null_fits
#> # A tibble: 3,000 x 3
#> # Groups:   replicate [1,000]
#>    replicate term          estimate
#>        <int> <chr>            <dbl>
#>  1         1 intercept     40.3    
#>  2         1 age            0.0166 
#>  3         1 collegedegree  1.20   
#>  4         2 intercept     41.3    
#>  5         2 age            0.00664
#>  6         2 collegedegree -0.407  
#>  7         3 intercept     42.9    
#>  8         3 age           -0.0371 
#>  9         3 collegedegree  0.00431
#> 10         4 intercept     42.7    
#> # ... with 2,990 more rows
#> # i Use `print(n = ...)` to see more rows

要置换响应变量以外的变量，generate()的variables参数允许从数据中选择要置换的列。注意，任何依赖于这些列的派生效果（例如，交互效果）也会受到影响。

除此之外，从零拟合中观察到的拟合和分布与calculate()的类似输出完全相同。例如，我们可以使用下面的代码来计算这些对象的 95% 置信区间。

get_confidence_interval(
  null_fits, 
  point_estimate = observed_fit, 
  level = .95)
#> # A tibble: 3 x 3
#>   term          lower_ci upper_ci
#>   <chr>            <dbl>    <dbl>
#> 1 age            -0.0948   0.0987
#> 2 collegedegree  -2.57     2.72  
#> 3 intercept      37.4     45.5

或者，可以从观测数据中对每个观测回归系数的p值进行阴影处理。

visualize(null_fits) + 
  shade_p_value(observed_fit, direction = "both")

9. 总结

就是这样!这个简介涵盖了infer大部分关键功能。更完整信息请参阅help(package = "infer")。

查看全文

https://www.xamrdz.com/backend/3r51996636.html