---
title: "Create GTF file"
author: "L Collado-Torres"
date: "`r doc_date()`"
output: 
  BiocStyle::html_document
---

This short script creates a GTF file for chromosome 17 with features organized by __gene_id__ using the information from the UCSC hg19 knownGene database. It then creates a second GTF file where 20% of the transcripts were dropped at random.


# Complete GTF

This code constructs a complete GTF file for chromosome 17 by extracting the information from _TxDB.Hsapiens.UCSC.hg19.knownGene_. These exons can overlap one another and is a larger set than the one included in _Rsubread_ (which uses _Refseq_). It is the set closest to the information we used for generating the reads.

```{r 'maingtf'}
library('TxDb.Hsapiens.UCSC.hg19.knownGene')
library('rtracklayer')
library('devtools')
library('Rsubread')

## Load annotation info for chr17
txdb <- keepSeqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene, 'chr17')
tx <- transcriptsBy(txdb, 'gene')

## Get the gene ids
gene_ids <- names(tx)

## Identify all exons
exons <- exons(txdb, columns = c('gene_id', 'tx_id', 'tx_name', 'exon_id'), vals = list('gene_id' = gene_ids))

## Explore exons a bit
length(exons)

## Clearly not disjoint exons
table(countOverlaps(exons) - 1)

## Note that the exon set would be a bit different if we used the data
## included in Rsubread. Notably, because they are not based on the same
## annotation: Refseq vs UCSC knownGene.
ann <- system.file("annot", "hg19_RefSeq_exon.txt", package = "Rsubread")
df <- read.table(ann, header = TRUE, stringsAsFactors = FALSE)
ex <- GRanges(subset(df, Chr == 'chr17'))

## A smaller number of total exons than we are using
length(ex)

## Some of the genes we are using are missing in the Rsubread data
sum(gene_ids %in% mcols(ex)$GeneID)
length(gene_ids)

## Create gtf file
mcols(exons)$type <- 'exon'
mcols(exons)$source <- 'hg19.UCSC.knownGene'
mcols(exons)$transcript_id <- mcols(exons)$tx_id
export(exons, 'chr17.gtf', format = 'gtf')
```

# Incomplete GTF

The following code drops 20% of the transcripts at random from the ones we used to generate the reads. Each exon has 1 or more transcript names associated to it. So, we check whether all the transcript names for a given exon are part of the 20% that we have selected to drop before dropping the exon. This results in an exon annotation set where less than 20% of the total exons are excluded.

```{r 'incomplete'}
## Get the transcript names
tx_name <- unlist(mcols(exons)$tx_name)

## Choose which to drop
load(file.path('..', 'generateReads', 'simulation_info.Rdata'))
set.seed(20151202)
tx_drop <- sample(chosen$ucsckg_id, size = round(nrow(chosen) * 0.2, 0))

## Map between transcript names and exons
map <- rep(seq_len(length(exons)), elementLengths(mcols(exons)$tx_name))

## Define incomplete exon set
exons_inc <- exons[!as.vector(tapply(tx_name %in% tx_drop, map, all))]

## Percent of remaining exons and number of remaining exons
length(exons_inc) / length(exons) * 100
length(exons_inc)

## Code for checking whether the incomplete exon set is within what we expected
nExons <- sapply(seq_len(100), function(x) {
    tx_drop <- sample(chosen$ucsckg_id, size = round(nrow(chosen) * 0.2, 0))
    length(exons[!as.vector(tapply(tx_name %in% tx_drop, map, all))])
})
summary(nExons) / length(exons) * 100
summary(nExons)


## Finally export the incomplete GTF
export(exons_inc, 'chr17-incomplete.gtf', format = 'gtf')
save(exons_inc, tx_drop, file = 'incompleteExons.Rdata')
```


# Reproducibility

```{r 'repro'}
## Reproducibility info
Sys.time()
proc.time()
options(width = 120)
session_info()
```