Many modern problems leverage machine learning methods to predict outcomes based on observable covariates. Subsequent statistical modeling, e.g. to understand population trends in outcomes, often involves treating predicted outcomes as interchangeable with observed data. `postpi`

is an R package we developed to correct downstream statistical inference using outcomes predicted with an arbitrary machine learning method. Our package can be applied to both continuous and categorical data. `postpi`

contains three functions:

`postpi_relate`

- required inputs: a data set (i.e. testing set) containing only observed and predicted outcomes, and column name for observed outcomes.
- optional inputs: a method from the caret package that user defines to relate categorical observed outcome and probablities of predicted categories. The default method set for the function is k-nearest neighbors.
- purpose: the function models the relationship between observed outcomes and predicted outcomes/probabilities, and returns the relationship model.

`postpi`

- required inputs: a data set (i.e. validation set) containing predicted outcomes and covariates, the relationship model estimated from
`postpi_relate()`

, and an inference formula. - optional inputs: the number of bootstrap times, and a seed number. The default number of bootstrap times is 100 and the default seed number is 1234.
- purposes: the function provides the corrected inference result table using a bootstrap approach for continuous/catigorical outcomes. The format of the output is a tidy table with 5 colomns: term, estimate, std.error, statistic, p.value.

- required inputs: a data set (i.e. validation set) containing predicted outcomes and covariates, the relationship model estimated from
`postpi_der`

- required inputs: a testing set containing observed and predicted continuous outcomes, column names for observed and predicted outcomes, a validation set containing predicted outcomes and covariates, and an inference formula.
- optional inputs: None.
- purposes: the function provides the corrected inference result table using a derivation approach only for continuous outcomes. The format of the output is a tidy table with 5 colomns: term, estimate, std.error, statistic, p.value.

Prepare a data set with observed outcomes and predicted outcomes/probabilities for each predicted categories, and covariates of interest for subsequant inferential analyses.

Split the data set into testing and validation sets. On testing set, use

`postpi_relate()`

to estimate a relationship model.On validation set, use

`postpi()`

/`postpi_der()`

to conduct inferential analyses.

Note: If users have a subset of observed outcomes but no predicted outcomes, they should split the data set into three sets – training, testing, and validation sets. On training set, use a machine learning method to train a prediction model, and apply it to testing and validation sets to get predicted results. Then users should repeat step 2-3 above to obtain inference results.

```
library(dplyr)
library(kableExtra)
```

In this example, we use a data set `RINdata`

available in the package. `RINdata`

contains a column of observed RIN values named `actual`

, a column of predicted RIN values named `prediction`

obtained from a previous trained data set, and 200 columns of gene expression regions. We want to study associations between RINs and gene expression levels. A detailed description of the `RINdata`

data set is available at our paper Post-prediction inference.

- We load
`RINdata`

and split data into testing and validation sets.

```
data("RINdata")
data <- RINdata
## split the data into testing and validation sets using rsample package
set.seed(2019)
data_split <- rsample::initial_split(data, prop = 1/2)
testing <- rsample::training(data_split)
validation <- rsample::testing(data_split)
```

- We select the columns of the observed and predicted outcomes from
`RINdata`

, and pass it to`postpi_relate()`

to estimate a relationship model named`rel_model`

.

```
## fit the relationship model on testing set
rel_model <- testing %>%
select(actual, predictions) %>%
postpi_relate(actual)
```

- We define an inference formula
`predictions ~ region_10`

such that we want to relate gene expression levels in region 10 to predicted RINs. Then we pass in the validation set, the defined inference formula, and the relationship model`rel_model`

estimated above to the inference function`postpi()`

. In`postpi()`

we estiamte inference results using a bootstrap approach and we obtain the results in a tidy table format named`results_postpi`

.

```
inf_formula <- predictions ~ region_10
## fit the inference model on validation set and make iap corrections using bootstrap approach
results_postpi <- validation %>%
postpi(rel_model, inf_formula)
```

- We repeat step 3, but now we pass necessary inputs to the inference function
`postpi_der`

. In`postpi()`

we estiamte inference results using a derivation approach and we obtain the results in a tidy table format named`results_der`

.

```
## fit the inference model on validation set and make iap corrections using derivation approach
results_der <- testing %>%
postpi_der(actual, predictions, validation, inf_formula)
```

- Now we have the inference results on validation set:
`results_postpi`

from a bootstrap approach`postpi()`

, and`results_der`

from a derivation approach. We compare them to the`no correction`

approach (i.e. inference results using predicted outcomes), and the`gold standard`

(i.e. inference results using observed outcomes).

Note in practice we don’t have observed outcomes for all samples so we perform inferences on predicted outcomes. In this example, we reserved the observed outcomes on validation set only for comparison purposes. To better compare results, we highlight the result of standard errors in red color and t-statistics in blue color.

We first print out the results from the `gold standard`

. We observe that the standard error for the estimate is 0.20 (red) and the t-statistic is -4.05 (blue).

```
## show the inference results on validation set using observed outcomes
broom::tidy(lm(update(inf_formula, actual ~ .), validation))[-1,] %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

region_10 | -0.0818374 | 0.0202238 | -4.046594 | 5.36e-05 |

From our bootstrap approach through function `postpi()`

we get the results with standard error 0.020 (red) and t-statistic -4.93 (blue).

```
kable(results_postpi) %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

region_10 | -0.1004698 | 0.0203786 | -4.930166 | 9e-07 |

From our derivation approach through function `postpi_der()`

we get the results with standard error 0.023 (red) and t-statistic -4.74 (blue).

```
kable(results_der) %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

region_10 | -0.1102849 | 0.0232892 | -4.735451 | 2.3e-06 |

We observe that the standard errors we obtained using our method are very similar to the gold standard, and the t-statistics from our method is also close to the gold standard. However, if we decide to use predicted RINs as the outcome variable directly in the subsequant inference analyses and with no corrections to the model estimates, we obtain the following results: the standard error is 0.017 (red), smaller than the gold standard 0.020; the t-statistic is -6.87, with absolute value larger than the gold standard -4.05.

```
## show the inference results on validation set without corrections
broom::tidy(lm(inf_formula, validation))[-1,] %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

region_10 | -0.1159294 | 0.0168645 | -6.87418 | 0 |

By comparing the inference results using our package and the no correction result, we conclude that our method works well to correct the inference towards the gold standard!

In the above example, we define the inference model for one covariate of interest as `predictions ~ region_10`

. Our method can also be applied to correct inferences for a covariate adjusting for other covariates. For example, we define an inference model as `predictions ~ region_10 + region_20 + region_50`

. We repeat step 2-5 in above analyses:

```
## fit the relationship model on testing set
rel_model <- testing %>%
select(actual, predictions) %>%
postpi_relate(actual)
inf_formula <- predictions ~ region_10 + region_20 + region_50
## fit the inference model on validation set and make iap corrections using bootstrap approach
results_postpi <- validation %>%
postpi(rel_model, inf_formula)
results_der <- testing %>%
postpi_der(actual, predictions, validation, inf_formula)
```

We print out the results from the `gold standard`

, our method `results_postpi`

and `results_der`

, and the `no correction`

.

```
## gold standard
broom::tidy(lm(update(inf_formula, actual ~ .), validation))[-1,] %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

region_10 | -0.0197434 | 0.0240504 | -0.8209152 | 0.4117768 |

region_20 | 0.0423561 | 0.0078126 | 5.4214824 | 0.0000001 |

region_50 | -0.1831706 | 0.0188360 | -9.7245065 | 0.0000000 |

```
## postpi
kable(results_postpi) %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

region_10 | -0.0659602 | 0.0244276 | -2.700234 | 0.0069781 |

region_20 | 0.0440862 | 0.0079070 | 5.575565 | 0.0000000 |

region_50 | -0.1449285 | 0.0191163 | -7.581420 | 0.0000000 |

```
## postpi_der
kable(results_der) %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

region_10 | -0.0575228 | 0.0280625 | -2.049810 | 0.0404923 |

region_20 | 0.0403675 | 0.0091159 | 4.428234 | 0.0000099 |

region_50 | -0.1637844 | 0.0219782 | -7.452136 | 0.0000000 |

```
## no correction
broom::tidy(lm(inf_formula, validation))[-1,] %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

region_10 | -0.0625164 | 0.0197435 | -3.166438 | 0.0015627 |

region_20 | 0.0523681 | 0.0064135 | 8.165238 | 0.0000000 |

region_50 | -0.1882510 | 0.0154628 | -12.174439 | 0.0000000 |

Again, we observe that we obtain more accurate standard errors (red), and t-statistics (blue) using our method `results_postpi`

and `results_der`

compared to the `no correction`

approach.

In this example, we use a data set `TISSUEdata`

available in the package. `TISSUEdata`

contains a column of observed tissue types (breast / adipose tissues) named `actual`

, a column of predicted tissue types named `predictions`

, two columns of the probabilities of each predicted tissue types named `Breast`

and `Adipose Tissue`

obtained from a previous trained data set, and 2281 columns of gene expression regions. We want to study associations between breast / adipose tissue types and gene expression levels. A detailed description of the `TISSUEdata`

data set is available at our paper Post-prediction inference.

- We read in data set
`TISSUEdata`

, clean it and make the class of observed and predicted tissue type columns (`actual`

and`predictions`

) to be factor. Then we split the clean data set into testing and validation sets.

```
data("TISSUEdata")
TISSUE_data <- TISSUEdata
colnames(TISSUE_data)[colnames(TISSUE_data) == "Adipose Tissue"] <- "Adipose_Tissue"
TISSUE_data$predictions <- as.character(TISSUE_data$predictions)
TISSUE_data$actual <- as.character(TISSUE_data$actual)
TISSUE_data[TISSUE_data == "Adipose Tissue"] <- "Adipose_Tissue"
TISSUE_data$actual <- as.factor(TISSUE_data$actual)
TISSUE_data$predictions <- as.factor(TISSUE_data$predictions)
## split the data into testing and validation sets using rsample package
set.seed(2019)
data_split <- rsample::initial_split(TISSUE_data, prop = 1/2)
testing <- rsample::training(data_split)
validation <- rsample::testing(data_split)
```

- We select the three columns from the data set: one column with the observed tissue types, two columns with the probabilities of the predicted tissue types. We then pass the data subset to
`postpi_relate()`

to estimate a relationship model between observed outcomes and predicted probabilities named`rel_model`

.

```
# fit the relationship model on testing set
rel_model <- testing %>%
select(actual, Adipose_Tissue, Breast) %>%
postpi_relate(actual)
```

- We define an inference formula
`predictions ~ region_200`

such that we want to relate gene expression levels in region 200 to predicted tissue types. Then we pass in the validation set, the defined inference formula, and the relationship model`rel_model`

estimated above to the inference function`postpi()`

. In`postpi()`

we estiamte inference results using a bootstrap approach and we obtain the results in a tidy table format named`results_postpi`

. In this example the data set we use contains categorical data, so we use bootstrap approach only.

```
inf_formula <- predictions ~ region_200
## fit the inference model on validation set and make iap corrections using bootstrap approach
results_postpi <- validation %>%
postpi(rel_model, inf_formula)
```

- Now we have the inference result
`results_postpi`

on validation set from a bootstrap approach. We compare it to the no correction approach (i.e. inference results using predicted outcomes), and the gold standard (i.e. inference results using observed outcomes).

Note in practice we don’t have observed outcomes for all samples so we perform inferences on predicted outcomes. In this example, we reserved the observed outcomes on validation set only for comparison purposes. Again we highlight standard errors in red color and t-statistics in blue color.

We first print out the results from the `gold standard`

. We observe that the standard error for the estimate is 0.25 (red) and the t-statistic is 3.80 (blue).

```
## show the inference results on validation set using observed outcomes
broom::tidy(glm(update(inf_formula, actual ~ .), validation, family = binomial(link = "logit")))[-1,] %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

region_200 | 0.9685137 | 0.2546283 | 3.803638 | 0.0001426 |

From our bootstrap approach through function `postpi()`

we get the results with standard error 0.23 (red) and t-statistic 3.60 (blue).

```
kable(results_postpi) %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

region_200 | 0.8457281 | 0.2346512 | 3.604192 | 0.0004324 |

We observe that the standard error and t-statistic we obtained using our method are close to the gold standard. However, if we decide to use predicted tissue types as the outcome variable directly in the subsequant inference analyses and with no corrections to the model estimates, we obtain the following results: the standard error is 0.20 (red), smaller than the gold standard 0.25; the t-statistic is 5.11, with absolute value larger than the gold standard 3.80.

```
## show the inference results on validation set without corrections
broom::tidy(glm(inf_formula, validation, family = binomial(link = "logit")))[-1,] %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

region_200 | 1.060776 | 0.207377 | 5.115207 | 3e-07 |

By comparing the inference results using our package and the no correction result, again we conclude that our method works well to correct the inference towards the gold standard!