vignettes for postpi

Example to use the package on a data set with continuous outcomes.

library(dplyr)
library(kableExtra)

In this example, we use a data set RINdata available in the package. RINdata contains a column of observed RIN values named actual, a column of predicted RIN values named prediction obtained from a previous trained data set, and 200 columns of gene expression regions. We want to study associations between RINs and gene expression levels. A detailed description of the RINdata data set is available at our paper Post-prediction inference.

We load RINdata and split data into testing and validation sets.

data("RINdata")
data <- RINdata

## split the data into testing and validation sets using rsample package
set.seed(2019)
data_split <- rsample::initial_split(data, prop = 1/2)
testing    <- rsample::training(data_split)
validation <- rsample::testing(data_split)

We select the columns of the observed and predicted outcomes from RINdata, and pass it to postpi_relate() to estimate a relationship model named rel_model.

## fit the relationship model on testing set
rel_model <- testing %>%
  select(actual, predictions) %>%
  postpi_relate(actual)

We define an inference formula predictions ~ region_10 such that we want to relate gene expression levels in region 10 to predicted RINs. Then we pass in the validation set, the defined inference formula, and the relationship model rel_model estimated above to the inference function postpi(). In postpi() we estiamte inference results using a bootstrap approach and we obtain the results in a tidy table format named results_postpi.

inf_formula <- predictions ~ region_10

## fit the inference model on validation set and make iap corrections using bootstrap approach
results_postpi <- validation %>%
  postpi(rel_model, inf_formula)

We repeat step 3, but now we pass necessary inputs to the inference function postpi_der. In postpi() we estiamte inference results using a derivation approach and we obtain the results in a tidy table format named results_der.

## fit the inference model on validation set and make iap corrections using derivation approach
results_der <- testing %>%
  postpi_der(actual, predictions, validation, inf_formula)

Now we have the inference results on validation set: results_postpi from a bootstrap approach postpi(), and results_der from a derivation approach. We compare them to the no correction approach (i.e. inference results using predicted outcomes), and the gold standard (i.e. inference results using observed outcomes).

Note in practice we don’t have observed outcomes for all samples so we perform inferences on predicted outcomes. In this example, we reserved the observed outcomes on validation set only for comparison purposes. To better compare results, we highlight the result of standard errors in red color and t-statistics in blue color.

We first print out the results from the gold standard. We observe that the standard error for the estimate is 0.20 (red) and the t-statistic is -4.05 (blue).

## show the inference results on validation set using observed outcomes
broom::tidy(lm(update(inf_formula, actual ~ .), validation))[-1,] %>%
  kable() %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(3, bold = T, color = "red") %>%
  column_spec(4, bold = T, color = "blue")

term	estimate	std.error	statistic	p.value
region_10	-0.0818374	0.0202238	-4.046594	5.36e-05

From our bootstrap approach through function postpi() we get the results with standard error 0.020 (red) and t-statistic -4.93 (blue).

kable(results_postpi) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(3, bold = T, color = "red") %>%
  column_spec(4, bold = T, color = "blue")

term	estimate	std.error	statistic	p.value
region_10	-0.1004698	0.0203786	-4.930166	9e-07

From our derivation approach through function postpi_der() we get the results with standard error 0.023 (red) and t-statistic -4.74 (blue).

kable(results_der) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(3, bold = T, color = "red") %>%
  column_spec(4, bold = T, color = "blue")

term	estimate	std.error	statistic	p.value
region_10	-0.1102849	0.0232892	-4.735451	2.3e-06

We observe that the standard errors we obtained using our method are very similar to the gold standard, and the t-statistics from our method is also close to the gold standard. However, if we decide to use predicted RINs as the outcome variable directly in the subsequant inference analyses and with no corrections to the model estimates, we obtain the following results: the standard error is 0.017 (red), smaller than the gold standard 0.020; the t-statistic is -6.87, with absolute value larger than the gold standard -4.05.

## show the inference results on validation set without corrections
broom::tidy(lm(inf_formula, validation))[-1,] %>%
  kable() %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(3, bold = T, color = "red") %>%
  column_spec(4, bold = T, color = "blue")

term	estimate	std.error	statistic	p.value
region_10	-0.1159294	0.0168645	-6.87418	0

By comparing the inference results using our package and the no correction result, we conclude that our method works well to correct the inference towards the gold standard!

In the above example, we define the inference model for one covariate of interest as predictions ~ region_10. Our method can also be applied to correct inferences for a covariate adjusting for other covariates. For example, we define an inference model as predictions ~ region_10 + region_20 + region_50. We repeat step 2-5 in above analyses:

## fit the relationship model on testing set
rel_model <- testing %>%
  select(actual, predictions) %>%
  postpi_relate(actual)

inf_formula <- predictions ~ region_10 + region_20 + region_50 

## fit the inference model on validation set and make iap corrections using bootstrap approach
results_postpi <- validation %>%
  postpi(rel_model, inf_formula)

results_der <- testing %>%
  postpi_der(actual, predictions, validation, inf_formula)

We print out the results from the gold standard, our method results_postpi and results_der, and the no correction.

## gold standard
broom::tidy(lm(update(inf_formula, actual ~ .), validation))[-1,] %>%
  kable() %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(3, bold = T, color = "red") %>%
  column_spec(4, bold = T, color = "blue")

term	estimate	std.error	statistic	p.value
region_10	-0.0197434	0.0240504	-0.8209152	0.4117768
region_20	0.0423561	0.0078126	5.4214824	0.0000001
region_50	-0.1831706	0.0188360	-9.7245065	0.0000000

## postpi
kable(results_postpi) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(3, bold = T, color = "red") %>%
  column_spec(4, bold = T, color = "blue")

term	estimate	std.error	statistic	p.value
region_10	-0.0659602	0.0244276	-2.700234	0.0069781
region_20	0.0440862	0.0079070	5.575565	0.0000000
region_50	-0.1449285	0.0191163	-7.581420	0.0000000

## postpi_der
kable(results_der) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(3, bold = T, color = "red") %>%
  column_spec(4, bold = T, color = "blue")

term	estimate	std.error	statistic	p.value
region_10	-0.0575228	0.0280625	-2.049810	0.0404923
region_20	0.0403675	0.0091159	4.428234	0.0000099
region_50	-0.1637844	0.0219782	-7.452136	0.0000000

## no correction
broom::tidy(lm(inf_formula, validation))[-1,] %>%
  kable() %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(3, bold = T, color = "red") %>%
  column_spec(4, bold = T, color = "blue")

term	estimate	std.error	statistic	p.value
region_10	-0.0625164	0.0197435	-3.166438	0.0015627
region_20	0.0523681	0.0064135	8.165238	0.0000000
region_50	-0.1882510	0.0154628	-12.174439	0.0000000

Again, we observe that we obtain more accurate standard errors (red), and t-statistics (blue) using our method results_postpi and results_der compared to the no correction approach.

Example to use the package on a data set with categorical outcomes.

In this example, we use a data set TISSUEdata available in the package. TISSUEdata contains a column of observed tissue types (breast / adipose tissues) named actual, a column of predicted tissue types named predictions, two columns of the probabilities of each predicted tissue types named Breast and Adipose Tissue obtained from a previous trained data set, and 2281 columns of gene expression regions. We want to study associations between breast / adipose tissue types and gene expression levels. A detailed description of the TISSUEdata data set is available at our paper Post-prediction inference.

We read in data set TISSUEdata, clean it and make the class of observed and predicted tissue type columns (actual and predictions) to be factor. Then we split the clean data set into testing and validation sets.

data("TISSUEdata")
TISSUE_data <- TISSUEdata


colnames(TISSUE_data)[colnames(TISSUE_data) == "Adipose Tissue"] <- "Adipose_Tissue"

TISSUE_data$predictions <- as.character(TISSUE_data$predictions)
TISSUE_data$actual      <- as.character(TISSUE_data$actual)

TISSUE_data[TISSUE_data == "Adipose Tissue"] <- "Adipose_Tissue"

TISSUE_data$actual      <- as.factor(TISSUE_data$actual)
TISSUE_data$predictions <- as.factor(TISSUE_data$predictions)

## split the data into testing and validation sets using rsample package
set.seed(2019)
data_split <- rsample::initial_split(TISSUE_data, prop = 1/2)
testing    <- rsample::training(data_split)
validation <- rsample::testing(data_split)

We select the three columns from the data set: one column with the observed tissue types, two columns with the probabilities of the predicted tissue types. We then pass the data subset to postpi_relate() to estimate a relationship model between observed outcomes and predicted probabilities named rel_model.

# fit the relationship model on testing set
rel_model <- testing %>%
  select(actual, Adipose_Tissue, Breast) %>%
  postpi_relate(actual)

We define an inference formula predictions ~ region_200 such that we want to relate gene expression levels in region 200 to predicted tissue types. Then we pass in the validation set, the defined inference formula, and the relationship model rel_model estimated above to the inference function postpi(). In postpi() we estiamte inference results using a bootstrap approach and we obtain the results in a tidy table format named results_postpi. In this example the data set we use contains categorical data, so we use bootstrap approach only.

inf_formula <- predictions ~ region_200

## fit the inference model on validation set and make iap corrections using bootstrap approach
results_postpi <- validation %>%
  postpi(rel_model, inf_formula)

Now we have the inference result results_postpi on validation set from a bootstrap approach. We compare it to the no correction approach (i.e. inference results using predicted outcomes), and the gold standard (i.e. inference results using observed outcomes).

Note in practice we don’t have observed outcomes for all samples so we perform inferences on predicted outcomes. In this example, we reserved the observed outcomes on validation set only for comparison purposes. Again we highlight standard errors in red color and t-statistics in blue color.

We first print out the results from the gold standard. We observe that the standard error for the estimate is 0.25 (red) and the t-statistic is 3.80 (blue).

## show the inference results on validation set using observed outcomes
broom::tidy(glm(update(inf_formula, actual ~ .), validation, family = binomial(link = "logit")))[-1,] %>%
  kable() %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(3, bold = T, color = "red") %>%
  column_spec(4, bold = T, color = "blue")

term	estimate	std.error	statistic	p.value
region_200	0.9685137	0.2546283	3.803638	0.0001426

From our bootstrap approach through function postpi() we get the results with standard error 0.23 (red) and t-statistic 3.60 (blue).

kable(results_postpi) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(3, bold = T, color = "red") %>%
  column_spec(4, bold = T, color = "blue")

term	estimate	std.error	statistic	p.value
region_200	0.8457281	0.2346512	3.604192	0.0004324

We observe that the standard error and t-statistic we obtained using our method are close to the gold standard. However, if we decide to use predicted tissue types as the outcome variable directly in the subsequant inference analyses and with no corrections to the model estimates, we obtain the following results: the standard error is 0.20 (red), smaller than the gold standard 0.25; the t-statistic is 5.11, with absolute value larger than the gold standard 3.80.

## show the inference results on validation set without corrections
broom::tidy(glm(inf_formula, validation, family = binomial(link = "logit")))[-1,] %>%
  kable() %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(3, bold = T, color = "red") %>%
  column_spec(4, bold = T, color = "blue")

term	estimate	std.error	statistic	p.value
region_200	1.060776	0.207377	5.115207	3e-07

By comparing the inference results using our package and the no correction result, again we conclude that our method works well to correct the inference towards the gold standard!

vignettes for postpi

Siruo Wang

September 25, 2019

Introduction to the package

Procedure to use the package

Example to use the package on a data set with continuous outcomes.

Example to use the package on a data set with categorical outcomes.