# vignettes for postpi

### Introduction to the package

Many modern problems leverage machine learning methods to predict outcomes based on observable covariates. Subsequent statistical modeling, e.g. to understand population trends in outcomes, often involves treating predicted outcomes as interchangeable with observed data. postpi is an R package we developed to correct downstream statistical inference using outcomes predicted with an arbitrary machine learning method. Our package can be applied to both continuous and categorical data. postpi contains three functions:

• postpi_relate
• required inputs: a data set (i.e. testing set) containing only observed and predicted outcomes, and column name for observed outcomes.
• optional inputs: a method from the caret package that user defines to relate categorical observed outcome and probablities of predicted categories. The default method set for the function is k-nearest neighbors.
• purpose: the function models the relationship between observed outcomes and predicted outcomes/probabilities, and returns the relationship model.
• postpi
• required inputs: a data set (i.e. validation set) containing predicted outcomes and covariates, the relationship model estimated from postpi_relate(), and an inference formula.
• optional inputs: the number of bootstrap times, and a seed number. The default number of bootstrap times is 100 and the default seed number is 1234.
• purposes: the function provides the corrected inference result table using a bootstrap approach for continuous/catigorical outcomes. The format of the output is a tidy table with 5 colomns: term, estimate, std.error, statistic, p.value.
• postpi_der
• required inputs: a testing set containing observed and predicted continuous outcomes, column names for observed and predicted outcomes, a validation set containing predicted outcomes and covariates, and an inference formula.
• optional inputs: None.
• purposes: the function provides the corrected inference result table using a derivation approach only for continuous outcomes. The format of the output is a tidy table with 5 colomns: term, estimate, std.error, statistic, p.value.

### Procedure to use the package

1. Prepare a data set with observed outcomes and predicted outcomes/probabilities for each predicted categories, and covariates of interest for subsequant inferential analyses.

2. Split the data set into testing and validation sets. On testing set, use postpi_relate() to estimate a relationship model.

3. On validation set, use postpi()/postpi_der() to conduct inferential analyses.

Note: If users have a subset of observed outcomes but no predicted outcomes, they should split the data set into three sets – training, testing, and validation sets. On training set, use a machine learning method to train a prediction model, and apply it to testing and validation sets to get predicted results. Then users should repeat step 2-3 above to obtain inference results.

### Example to use the package on a data set with continuous outcomes.

library(dplyr)
library(kableExtra)

In this example, we use a data set RINdata available in the package. RINdata contains a column of observed RIN values named actual, a column of predicted RIN values named prediction obtained from a previous trained data set, and 200 columns of gene expression regions. We want to study associations between RINs and gene expression levels. A detailed description of the RINdata data set is available at our paper Post-prediction inference.

1. We load RINdata and split data into testing and validation sets.
data("RINdata")
data <- RINdata

## split the data into testing and validation sets using rsample package
set.seed(2019)
data_split <- rsample::initial_split(data, prop = 1/2)
testing    <- rsample::training(data_split)
validation <- rsample::testing(data_split)
1. We select the columns of the observed and predicted outcomes from RINdata, and pass it to postpi_relate() to estimate a relationship model named rel_model.
## fit the relationship model on testing set
rel_model <- testing %>%
select(actual, predictions) %>%
postpi_relate(actual)
1. We define an inference formula predictions ~ region_10 such that we want to relate gene expression levels in region 10 to predicted RINs. Then we pass in the validation set, the defined inference formula, and the relationship model rel_model estimated above to the inference function postpi(). In postpi() we estiamte inference results using a bootstrap approach and we obtain the results in a tidy table format named results_postpi.
inf_formula <- predictions ~ region_10

## fit the inference model on validation set and make iap corrections using bootstrap approach
results_postpi <- validation %>%
postpi(rel_model, inf_formula)
1. We repeat step 3, but now we pass necessary inputs to the inference function postpi_der. In postpi() we estiamte inference results using a derivation approach and we obtain the results in a tidy table format named results_der.
## fit the inference model on validation set and make iap corrections using derivation approach
results_der <- testing %>%
postpi_der(actual, predictions, validation, inf_formula)
1. Now we have the inference results on validation set: results_postpi from a bootstrap approach postpi(), and results_der from a derivation approach. We compare them to the no correction approach (i.e. inference results using predicted outcomes), and the gold standard (i.e. inference results using observed outcomes).

Note in practice we don’t have observed outcomes for all samples so we perform inferences on predicted outcomes. In this example, we reserved the observed outcomes on validation set only for comparison purposes. To better compare results, we highlight the result of standard errors in red color and t-statistics in blue color.

We first print out the results from the gold standard. We observe that the standard error for the estimate is 0.20 (red) and the t-statistic is -4.05 (blue).

## show the inference results on validation set using observed outcomes
broom::tidy(lm(update(inf_formula, actual ~ .), validation))[-1,] %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
term estimate std.error statistic p.value
region_10 -0.0818374 0.0202238 -4.046594 5.36e-05

From our bootstrap approach through function postpi() we get the results with standard error 0.020 (red) and t-statistic -4.93 (blue).

kable(results_postpi) %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
term estimate std.error statistic p.value
region_10 -0.1004698 0.0203786 -4.930166 9e-07

From our derivation approach through function postpi_der() we get the results with standard error 0.023 (red) and t-statistic -4.74 (blue).

kable(results_der) %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
term estimate std.error statistic p.value
region_10 -0.1102849 0.0232892 -4.735451 2.3e-06

We observe that the standard errors we obtained using our method are very similar to the gold standard, and the t-statistics from our method is also close to the gold standard. However, if we decide to use predicted RINs as the outcome variable directly in the subsequant inference analyses and with no corrections to the model estimates, we obtain the following results: the standard error is 0.017 (red), smaller than the gold standard 0.020; the t-statistic is -6.87, with absolute value larger than the gold standard -4.05.

## show the inference results on validation set without corrections
broom::tidy(lm(inf_formula, validation))[-1,] %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
term estimate std.error statistic p.value
region_10 -0.1159294 0.0168645 -6.87418 0

By comparing the inference results using our package and the no correction result, we conclude that our method works well to correct the inference towards the gold standard!

In the above example, we define the inference model for one covariate of interest as predictions ~ region_10. Our method can also be applied to correct inferences for a covariate adjusting for other covariates. For example, we define an inference model as predictions ~ region_10 + region_20 + region_50. We repeat step 2-5 in above analyses:

## fit the relationship model on testing set
rel_model <- testing %>%
select(actual, predictions) %>%
postpi_relate(actual)

inf_formula <- predictions ~ region_10 + region_20 + region_50

## fit the inference model on validation set and make iap corrections using bootstrap approach
results_postpi <- validation %>%
postpi(rel_model, inf_formula)

results_der <- testing %>%
postpi_der(actual, predictions, validation, inf_formula)

We print out the results from the gold standard, our method results_postpi and results_der, and the no correction.

## gold standard
broom::tidy(lm(update(inf_formula, actual ~ .), validation))[-1,] %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
term estimate std.error statistic p.value
region_10 -0.0197434 0.0240504 -0.8209152 0.4117768
region_20 0.0423561 0.0078126 5.4214824 0.0000001
region_50 -0.1831706 0.0188360 -9.7245065 0.0000000
## postpi
kable(results_postpi) %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
term estimate std.error statistic p.value
region_10 -0.0659602 0.0244276 -2.700234 0.0069781
region_20 0.0440862 0.0079070 5.575565 0.0000000
region_50 -0.1449285 0.0191163 -7.581420 0.0000000
## postpi_der
kable(results_der) %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
term estimate std.error statistic p.value
region_10 -0.0575228 0.0280625 -2.049810 0.0404923
region_20 0.0403675 0.0091159 4.428234 0.0000099
region_50 -0.1637844 0.0219782 -7.452136 0.0000000
## no correction
broom::tidy(lm(inf_formula, validation))[-1,] %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
term estimate std.error statistic p.value
region_10 -0.0625164 0.0197435 -3.166438 0.0015627
region_20 0.0523681 0.0064135 8.165238 0.0000000
region_50 -0.1882510 0.0154628 -12.174439 0.0000000

Again, we observe that we obtain more accurate standard errors (red), and t-statistics (blue) using our method results_postpi and results_der compared to the no correction approach.

### Example to use the package on a data set with categorical outcomes.

In this example, we use a data set TISSUEdata available in the package. TISSUEdata contains a column of observed tissue types (breast / adipose tissues) named actual, a column of predicted tissue types named predictions, two columns of the probabilities of each predicted tissue types named Breast and Adipose Tissue obtained from a previous trained data set, and 2281 columns of gene expression regions. We want to study associations between breast / adipose tissue types and gene expression levels. A detailed description of the TISSUEdata data set is available at our paper Post-prediction inference.

1. We read in data set TISSUEdata, clean it and make the class of observed and predicted tissue type columns (actual and predictions) to be factor. Then we split the clean data set into testing and validation sets.
data("TISSUEdata")
TISSUE_data <- TISSUEdata

TISSUE_data$predictions <- as.character(TISSUE_data$predictions)
TISSUE_data$actual <- as.character(TISSUE_data$actual)

TISSUE_data$actual <- as.factor(TISSUE_data$actual)
TISSUE_data$predictions <- as.factor(TISSUE_data$predictions)

## split the data into testing and validation sets using rsample package
set.seed(2019)
data_split <- rsample::initial_split(TISSUE_data, prop = 1/2)
testing    <- rsample::training(data_split)
validation <- rsample::testing(data_split)
1. We select the three columns from the data set: one column with the observed tissue types, two columns with the probabilities of the predicted tissue types. We then pass the data subset to postpi_relate() to estimate a relationship model between observed outcomes and predicted probabilities named rel_model.
# fit the relationship model on testing set
rel_model <- testing %>%
postpi_relate(actual)
1. We define an inference formula predictions ~ region_200 such that we want to relate gene expression levels in region 200 to predicted tissue types. Then we pass in the validation set, the defined inference formula, and the relationship model rel_model estimated above to the inference function postpi(). In postpi() we estiamte inference results using a bootstrap approach and we obtain the results in a tidy table format named results_postpi. In this example the data set we use contains categorical data, so we use bootstrap approach only.
inf_formula <- predictions ~ region_200

## fit the inference model on validation set and make iap corrections using bootstrap approach
results_postpi <- validation %>%
postpi(rel_model, inf_formula)
1. Now we have the inference result results_postpi on validation set from a bootstrap approach. We compare it to the no correction approach (i.e. inference results using predicted outcomes), and the gold standard (i.e. inference results using observed outcomes).

Note in practice we don’t have observed outcomes for all samples so we perform inferences on predicted outcomes. In this example, we reserved the observed outcomes on validation set only for comparison purposes. Again we highlight standard errors in red color and t-statistics in blue color.

We first print out the results from the gold standard. We observe that the standard error for the estimate is 0.25 (red) and the t-statistic is 3.80 (blue).

## show the inference results on validation set using observed outcomes
broom::tidy(glm(update(inf_formula, actual ~ .), validation, family = binomial(link = "logit")))[-1,] %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
term estimate std.error statistic p.value
region_200 0.9685137 0.2546283 3.803638 0.0001426

From our bootstrap approach through function postpi() we get the results with standard error 0.23 (red) and t-statistic 3.60 (blue).

kable(results_postpi) %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
term estimate std.error statistic p.value
region_200 0.8457281 0.2346512 3.604192 0.0004324

We observe that the standard error and t-statistic we obtained using our method are close to the gold standard. However, if we decide to use predicted tissue types as the outcome variable directly in the subsequant inference analyses and with no corrections to the model estimates, we obtain the following results: the standard error is 0.20 (red), smaller than the gold standard 0.25; the t-statistic is 5.11, with absolute value larger than the gold standard 3.80.

## show the inference results on validation set without corrections
broom::tidy(glm(inf_formula, validation, family = binomial(link = "logit")))[-1,] %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
column_spec(3, bold = T, color = "red") %>%
column_spec(4, bold = T, color = "blue")
term estimate std.error statistic p.value
region_200 1.060776 0.207377 5.115207 3e-07

By comparing the inference results using our package and the no correction result, again we conclude that our method works well to correct the inference towards the gold standard!