--- title: "Create GTF file" author: "L Collado-Torres" date: "`r doc_date()`" output: BiocStyle::html_document --- This short script creates a GTF file for chromosome 17 with features organized by __gene_id__ using the information from the UCSC hg19 knownGene database. It then creates a second GTF file where 20% of the transcripts were dropped at random. # Complete GTF This code constructs a complete GTF file for chromosome 17 by extracting the information from _TxDB.Hsapiens.UCSC.hg19.knownGene_. These exons can overlap one another and is a larger set than the one included in _Rsubread_ (which uses _Refseq_). It is the set closest to the information we used for generating the reads. ```{r 'maingtf'} library('TxDb.Hsapiens.UCSC.hg19.knownGene') library('rtracklayer') library('devtools') library('Rsubread') ## Load annotation info for chr17 txdb <- keepSeqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene, 'chr17') tx <- transcriptsBy(txdb, 'gene') ## Get the gene ids gene_ids <- names(tx) ## Identify all exons exons <- exons(txdb, columns = c('gene_id', 'tx_id', 'tx_name', 'exon_id'), vals = list('gene_id' = gene_ids)) ## Explore exons a bit length(exons) ## Clearly not disjoint exons table(countOverlaps(exons) - 1) ## Note that the exon set would be a bit different if we used the data ## included in Rsubread. Notably, because they are not based on the same ## annotation: Refseq vs UCSC knownGene. ann <- system.file("annot", "hg19_RefSeq_exon.txt", package = "Rsubread") df <- read.table(ann, header = TRUE, stringsAsFactors = FALSE) ex <- GRanges(subset(df, Chr == 'chr17')) ## A smaller number of total exons than we are using length(ex) ## Some of the genes we are using are missing in the Rsubread data sum(gene_ids %in% mcols(ex)$GeneID) length(gene_ids) ## Create gtf file mcols(exons)$type <- 'exon' mcols(exons)$source <- 'hg19.UCSC.knownGene' mcols(exons)$transcript_id <- mcols(exons)$tx_id export(exons, 'chr17.gtf', format = 'gtf') ``` # Incomplete GTF The following code drops 20% of the transcripts at random from the ones we used to generate the reads. Each exon has 1 or more transcript names associated to it. So, we check whether all the transcript names for a given exon are part of the 20% that we have selected to drop before dropping the exon. This results in an exon annotation set where less than 20% of the total exons are excluded. ```{r 'incomplete'} ## Get the transcript names tx_name <- unlist(mcols(exons)$tx_name) ## Choose which to drop load(file.path('..', 'generateReads', 'simulation_info.Rdata')) set.seed(20151202) tx_drop <- sample(chosen$ucsckg_id, size = round(nrow(chosen) * 0.2, 0)) ## Map between transcript names and exons map <- rep(seq_len(length(exons)), elementLengths(mcols(exons)$tx_name)) ## Define incomplete exon set exons_inc <- exons[!as.vector(tapply(tx_name %in% tx_drop, map, all))] ## Percent of remaining exons and number of remaining exons length(exons_inc) / length(exons) * 100 length(exons_inc) ## Code for checking whether the incomplete exon set is within what we expected nExons <- sapply(seq_len(100), function(x) { tx_drop <- sample(chosen$ucsckg_id, size = round(nrow(chosen) * 0.2, 0)) length(exons[!as.vector(tapply(tx_name %in% tx_drop, map, all))]) }) summary(nExons) / length(exons) * 100 summary(nExons) ## Finally export the incomplete GTF export(exons_inc, 'chr17-incomplete.gtf', format = 'gtf') save(exons_inc, tx_drop, file = 'incompleteExons.Rdata') ``` # Reproducibility ```{r 'repro'} ## Reproducibility info Sys.time() proc.time() options(width = 120) session_info() ```