This short script creates a GTF file for chromosome 17 with features organized by gene_id using the information from the UCSC hg19 knownGene database. It then creates a second GTF file where 20% of the transcripts were dropped at random.
This code constructs a complete GTF file for chromosome 17 by extracting the information from TxDB.Hsapiens.UCSC.hg19.knownGene. These exons can overlap one another and is a larger set than the one included in Rsubread (which uses Refseq). It is the set closest to the information we used for generating the reads.
library('TxDb.Hsapiens.UCSC.hg19.knownGene')
## Loading required package: GenomicFeatures
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## grep, grepl, intersect, is.unsorted, lapply, lengths, Map,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
## setdiff, sort, table, tapply, union, unique, unlist, unsplit
##
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
library('rtracklayer')
library('devtools')
library('Rsubread')
## Load annotation info for chr17
txdb <- keepSeqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene, 'chr17')
tx <- transcriptsBy(txdb, 'gene')
## Get the gene ids
gene_ids <- names(tx)
## Identify all exons
exons <- exons(txdb, columns = c('gene_id', 'tx_id', 'tx_name', 'exon_id'), vals = list('gene_id' = gene_ids))
## Explore exons a bit
length(exons)
## [1] 15033
## Clearly not disjoint exons
table(countOverlaps(exons) - 1)
##
## 0 1 2 3 4 5 6 8
## 11795 2568 504 108 39 16 2 1
## Note that the exon set would be a bit different if we used the data
## included in Rsubread. Notably, because they are not based on the same
## annotation: Refseq vs UCSC knownGene.
ann <- system.file("annot", "hg19_RefSeq_exon.txt", package = "Rsubread")
df <- read.table(ann, header = TRUE, stringsAsFactors = FALSE)
ex <- GRanges(subset(df, Chr == 'chr17'))
## A smaller number of total exons than we are using
length(ex)
## [1] 12902
## Some of the genes we are using are missing in the Rsubread data
sum(gene_ids %in% mcols(ex)$GeneID)
## [1] 1291
length(gene_ids)
## [1] 1358
## Create gtf file
mcols(exons)$type <- 'exon'
mcols(exons)$source <- 'hg19.UCSC.knownGene'
mcols(exons)$transcript_id <- mcols(exons)$tx_id
export(exons, 'chr17.gtf', format = 'gtf')
The following code drops 20% of the transcripts at random from the ones we used to generate the reads. Each exon has 1 or more transcript names associated to it. So, we check whether all the transcript names for a given exon are part of the 20% that we have selected to drop before dropping the exon. This results in an exon annotation set where less than 20% of the total exons are excluded.
## Get the transcript names
tx_name <- unlist(mcols(exons)$tx_name)
## Choose which to drop
load(file.path('..', 'generateReads', 'simulation_info.Rdata'))
set.seed(20151202)
tx_drop <- sample(chosen$ucsckg_id, size = round(nrow(chosen) * 0.2, 0))
## Map between transcript names and exons
map <- rep(seq_len(length(exons)), elementLengths(mcols(exons)$tx_name))
## Define incomplete exon set
exons_inc <- exons[!as.vector(tapply(tx_name %in% tx_drop, map, all))]
## Percent of remaining exons and number of remaining exons
length(exons_inc) / length(exons) * 100
## [1] 91.71822
length(exons_inc)
## [1] 13788
## Code for checking whether the incomplete exon set is within what we expected
nExons <- sapply(seq_len(100), function(x) {
tx_drop <- sample(chosen$ucsckg_id, size = round(nrow(chosen) * 0.2, 0))
length(exons[!as.vector(tapply(tx_name %in% tx_drop, map, all))])
})
summary(nExons) / length(exons) * 100
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 90.00200 91.06632 91.53196 91.46544 91.79804 92.66281
summary(nExons)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13530 13690 13760 13750 13800 13930
## Finally export the incomplete GTF
export(exons_inc, 'chr17-incomplete.gtf', format = 'gtf')
save(exons_inc, tx_drop, file = 'incompleteExons.Rdata')
## Reproducibility info
Sys.time()
## [1] "2015-12-02 19:02:56 EST"
proc.time()
## user system elapsed
## 22.834 0.721 24.134
options(width = 120)
session_info()
## Session info -----------------------------------------------------------------------------------------------------------
## setting value
## version R version 3.2.2 (2015-08-14)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/New_York
## date 2015-12-02
## Packages ---------------------------------------------------------------------------------------------------------------
## package * version date source
## AnnotationDbi * 1.32.0 2015-10-14 Bioconductor
## Biobase * 2.30.0 2015-10-14 Bioconductor
## BiocGenerics * 0.16.1 2015-11-06 Bioconductor
## BiocParallel 1.4.0 2015-10-14 Bioconductor
## BiocStyle * 1.8.0 2015-10-14 Bioconductor
## biomaRt 2.26.1 2015-11-23 Bioconductor
## Biostrings 2.38.2 2015-11-21 Bioconductor
## bitops 1.0-6 2013-08-17 CRAN (R 3.2.0)
## DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
## devtools * 1.9.1 2015-09-11 CRAN (R 3.2.0)
## digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
## evaluate 0.8 2015-09-18 CRAN (R 3.2.0)
## formatR 1.2.1 2015-09-18 CRAN (R 3.2.0)
## futile.logger 1.4.1 2015-04-20 CRAN (R 3.2.0)
## futile.options 1.0.0 2010-04-06 CRAN (R 3.2.0)
## GenomeInfoDb * 1.6.1 2015-11-03 Bioconductor
## GenomicAlignments 1.6.1 2015-10-22 Bioconductor
## GenomicFeatures * 1.22.5 2015-11-21 Bioconductor
## GenomicRanges * 1.22.1 2015-11-06 Bioconductor
## htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
## IRanges * 2.4.4 2015-11-21 Bioconductor
## knitr 1.11 2015-08-14 CRAN (R 3.2.2)
## lambda.r 1.1.7 2015-03-20 CRAN (R 3.2.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
## memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
## RCurl 1.95-4.7 2015-06-30 CRAN (R 3.2.1)
## rmarkdown * 0.8.1 2015-10-10 CRAN (R 3.2.2)
## Rsamtools 1.22.0 2015-10-14 Bioconductor
## RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
## Rsubread * 1.20.2 2015-11-06 Bioconductor
## rtracklayer * 1.30.1 2015-10-22 Bioconductor
## S4Vectors * 0.8.3 2015-11-18 Bioconductor
## stringi 1.0-1 2015-10-22 CRAN (R 3.2.0)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
## SummarizedExperiment 1.0.1 2015-11-06 Bioconductor
## TxDb.Hsapiens.UCSC.hg19.knownGene * 3.2.2 2015-10-15 Bioconductor
## XML 3.98-1.3 2015-06-30 CRAN (R 3.2.0)
## XVector 0.10.0 2015-10-14 Bioconductor
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)
## zlibbioc 1.16.0 2015-10-14 Bioconductor