This short script creates a GTF file for chromosome 17 with features organized by gene_id using the information from the UCSC hg19 knownGene database. It then creates a second GTF file where 20% of the transcripts were dropped at random.

1 Complete GTF

This code constructs a complete GTF file for chromosome 17 by extracting the information from TxDB.Hsapiens.UCSC.hg19.knownGene. These exons can overlap one another and is a larger set than the one included in Rsubread (which uses Refseq). It is the set closest to the information we used for generating the reads.

library('TxDb.Hsapiens.UCSC.hg19.knownGene')

## Loading required package: GenomicFeatures
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     grep, grepl, intersect, is.unsorted, lapply, lengths, Map,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

library('rtracklayer')
library('devtools')
library('Rsubread')

## Load annotation info for chr17
txdb <- keepSeqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene, 'chr17')
tx <- transcriptsBy(txdb, 'gene')

## Get the gene ids
gene_ids <- names(tx)

## Identify all exons
exons <- exons(txdb, columns = c('gene_id', 'tx_id', 'tx_name', 'exon_id'), vals = list('gene_id' = gene_ids))

## Explore exons a bit
length(exons)

## [1] 15033

## Clearly not disjoint exons
table(countOverlaps(exons) - 1)

## 
##     0     1     2     3     4     5     6     8 
## 11795  2568   504   108    39    16     2     1

## Note that the exon set would be a bit different if we used the data
## included in Rsubread. Notably, because they are not based on the same
## annotation: Refseq vs UCSC knownGene.
ann <- system.file("annot", "hg19_RefSeq_exon.txt", package = "Rsubread")
df <- read.table(ann, header = TRUE, stringsAsFactors = FALSE)
ex <- GRanges(subset(df, Chr == 'chr17'))

## A smaller number of total exons than we are using
length(ex)

## [1] 12902

## Some of the genes we are using are missing in the Rsubread data
sum(gene_ids %in% mcols(ex)$GeneID)

## [1] 1291

length(gene_ids)

## [1] 1358

## Create gtf file
mcols(exons)$type <- 'exon'
mcols(exons)$source <- 'hg19.UCSC.knownGene'
mcols(exons)$transcript_id <- mcols(exons)$tx_id
export(exons, 'chr17.gtf', format = 'gtf')

2 Incomplete GTF

The following code drops 20% of the transcripts at random from the ones we used to generate the reads. Each exon has 1 or more transcript names associated to it. So, we check whether all the transcript names for a given exon are part of the 20% that we have selected to drop before dropping the exon. This results in an exon annotation set where less than 20% of the total exons are excluded.

## Get the transcript names
tx_name <- unlist(mcols(exons)$tx_name)

## Choose which to drop
load(file.path('..', 'generateReads', 'simulation_info.Rdata'))
set.seed(20151202)
tx_drop <- sample(chosen$ucsckg_id, size = round(nrow(chosen) * 0.2, 0))

## Map between transcript names and exons
map <- rep(seq_len(length(exons)), elementLengths(mcols(exons)$tx_name))

## Define incomplete exon set
exons_inc <- exons[!as.vector(tapply(tx_name %in% tx_drop, map, all))]

## Percent of remaining exons and number of remaining exons
length(exons_inc) / length(exons) * 100

## [1] 91.71822

length(exons_inc)

## [1] 13788

## Code for checking whether the incomplete exon set is within what we expected
nExons <- sapply(seq_len(100), function(x) {
    tx_drop <- sample(chosen$ucsckg_id, size = round(nrow(chosen) * 0.2, 0))
    length(exons[!as.vector(tapply(tx_name %in% tx_drop, map, all))])
})
summary(nExons) / length(exons) * 100

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 90.00200 91.06632 91.53196 91.46544 91.79804 92.66281

summary(nExons)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13530   13690   13760   13750   13800   13930

## Finally export the incomplete GTF
export(exons_inc, 'chr17-incomplete.gtf', format = 'gtf')
save(exons_inc, tx_drop, file = 'incompleteExons.Rdata')

3 Reproducibility

## Reproducibility info
Sys.time()

## [1] "2015-12-02 19:02:56 EST"

proc.time()

##    user  system elapsed 
##  22.834   0.721  24.134

options(width = 120)
session_info()

## Session info -----------------------------------------------------------------------------------------------------------

##  setting  value                       
##  version  R version 3.2.2 (2015-08-14)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2015-12-02

## Packages ---------------------------------------------------------------------------------------------------------------

##  package                           * version  date       source        
##  AnnotationDbi                     * 1.32.0   2015-10-14 Bioconductor  
##  Biobase                           * 2.30.0   2015-10-14 Bioconductor  
##  BiocGenerics                      * 0.16.1   2015-11-06 Bioconductor  
##  BiocParallel                        1.4.0    2015-10-14 Bioconductor  
##  BiocStyle                         * 1.8.0    2015-10-14 Bioconductor  
##  biomaRt                             2.26.1   2015-11-23 Bioconductor  
##  Biostrings                          2.38.2   2015-11-21 Bioconductor  
##  bitops                              1.0-6    2013-08-17 CRAN (R 3.2.0)
##  DBI                                 0.3.1    2014-09-24 CRAN (R 3.2.0)
##  devtools                          * 1.9.1    2015-09-11 CRAN (R 3.2.0)
##  digest                              0.6.8    2014-12-31 CRAN (R 3.2.0)
##  evaluate                            0.8      2015-09-18 CRAN (R 3.2.0)
##  formatR                             1.2.1    2015-09-18 CRAN (R 3.2.0)
##  futile.logger                       1.4.1    2015-04-20 CRAN (R 3.2.0)
##  futile.options                      1.0.0    2010-04-06 CRAN (R 3.2.0)
##  GenomeInfoDb                      * 1.6.1    2015-11-03 Bioconductor  
##  GenomicAlignments                   1.6.1    2015-10-22 Bioconductor  
##  GenomicFeatures                   * 1.22.5   2015-11-21 Bioconductor  
##  GenomicRanges                     * 1.22.1   2015-11-06 Bioconductor  
##  htmltools                           0.2.6    2014-09-08 CRAN (R 3.2.0)
##  IRanges                           * 2.4.4    2015-11-21 Bioconductor  
##  knitr                               1.11     2015-08-14 CRAN (R 3.2.2)
##  lambda.r                            1.1.7    2015-03-20 CRAN (R 3.2.0)
##  magrittr                            1.5      2014-11-22 CRAN (R 3.2.0)
##  memoise                             0.2.1    2014-04-22 CRAN (R 3.2.0)
##  RCurl                               1.95-4.7 2015-06-30 CRAN (R 3.2.1)
##  rmarkdown                         * 0.8.1    2015-10-10 CRAN (R 3.2.2)
##  Rsamtools                           1.22.0   2015-10-14 Bioconductor  
##  RSQLite                             1.0.0    2014-10-25 CRAN (R 3.2.0)
##  Rsubread                          * 1.20.2   2015-11-06 Bioconductor  
##  rtracklayer                       * 1.30.1   2015-10-22 Bioconductor  
##  S4Vectors                         * 0.8.3    2015-11-18 Bioconductor  
##  stringi                             1.0-1    2015-10-22 CRAN (R 3.2.0)
##  stringr                             1.0.0    2015-04-30 CRAN (R 3.2.0)
##  SummarizedExperiment                1.0.1    2015-11-06 Bioconductor  
##  TxDb.Hsapiens.UCSC.hg19.knownGene * 3.2.2    2015-10-15 Bioconductor  
##  XML                                 3.98-1.3 2015-06-30 CRAN (R 3.2.0)
##  XVector                             0.10.0   2015-10-14 Bioconductor  
##  yaml                                2.1.13   2014-06-12 CRAN (R 3.2.0)
##  zlibbioc                            1.16.0   2015-10-14 Bioconductor

Create GTF file

L Collado-Torres

2 December 2015

Contents

1 Complete GTF

2 Incomplete GTF

3 Reproducibility