This page describes the supplementary material for the derfinder software paper. All the bash, R and R Markdown source files used to analyze the data for this project as well as generate the HTML reports are available in this website. However, it is easier to view them at github.com/leekgroup/derSupplement.

1 BrainSpan data set

This section of the website describes the code and reports associated with the BrainSpan data set that are referred to in the paper and Supplementary Methods and Results.

1.1 Code to reproduce analyses

There are 9 main bash scripts named _step1-*_ through _step9-*_ for running the expressed regions-level and single base-level approaches.

fullCoverage loads the data from the raw files. See step1-fullCoverage.sh and step1-fullCoverage.R.
makeModels creates the models used for the single-level base analysis. See step2-makeModels.sh and step2-makeModels.R.
analyzeChr runs the single base-level analysis by chromosome. See step3-analyzeChr.sh and step3-analyzeChr.R.
mergeResults merges the single base-level analysis results for all the chromosomes. See step4-mergeResults.sh.
derfinderReport generates a HTML report for the single base-level DERs. See step5-derfinderReport.sh.
regionMatrix identifies the expressed regions for the expressed-regions level approach. See step6-regionMatrix.sh.
regMatVsDERs creates a simple HTML report comparing the single base-level and expressed regions-level approaches. See step7-regMatVsDERs.sh and step7-regMatVsDERs.Rmd.
coverageToExon creates an exon count table using known annotation information. See step8-coverageToExon.sh and step8-coverageToExon.R.
summaryInfo creates a HTML report with brief summary information for the given experiment. See step9-summaryInfo.sh, step9-summaryInfo.R, and step9-summaryInfo.Rmd.

A final bash script, run-all.sh, can be used to run the main 9 steps (or a subset of them).

All scripts show at the beginning the way they were used. Some of them generate intermediate small bash scripts, for example one script per chromosome for the analyzeChr step. For some steps, there is a companion R or R Markdown code file when the code is more involved or an HTML file is generated in the particular step.

The check-analysis-time.R script was useful for checking the progress of the step3-analyzeChr jobs and detect whenever a node in the cluster was presenting problems.

We expect that these scripts will be useful to derfinder users who want to automate the single base-level and/or expressed regions-level analyses for several data sets and/or have the jobs run automatically without having to check if each step has finished running.

Note that all bash scripts are tailored for the cluster we have access to which administer job queues with Sun Grid Engine (SGE).

1.2 Single base-level

1.2.1 Quick overview HTML report

This HTML report contains basic information on the derfinder (Collado-Torres, Frazee, Love, Irizarry, et al., 2015) results from the BrainSpan data set. The report answers basic questions such as:

What is the number of filtered bases?
What is the number of candidate regions?
How many candidate regions are significant?

It also illustrates three clusters of candidate differentially expressed regions (DERs) from the single base-level analysis. You can view the report by following this link:

BrainSpan

1.2.2 CSV files and annotation comparison

This HTML report has the code for loading the R data files and generating the CSV files. The report also has Venn diagrams showing the number of candidate DERs from the single base-level analysis that overlap known exons, introns and intergenic regions using the UCSC hg19 annotation. It also includes a detailed description of the columns in the CSV file.

View the venn report or its R Markdown source file venn.Rmd.

1.3 Timing and memory information

This HTML report has code for reading and processing the time and memory information for each job extracted with efficiency_analytics (Frazee, 2014). The report contains a detailed description of the analysis steps and tables summarizing the maximum memory and time for each analysis step if all the jobs for that particular step were running simultaneously. Finally, there is an interactive table with the timing results.

View the timing report or check the R Markdown file timing.Rmd.

2 GTEx analysis

The script mergeInfo.R takes several phenotype tables and merges them into a single one. This information is then used by the select_samples.R script for choosing the 24 samples to analyze. These samples have a RIN greater than 7 and are from subjects that have samples from the heart (left ventricle), testis and liver. The script create_meanCov.R creates a mean coverage BigWig file just as you would get from using Rail-RNA (Nellore, Collado-Torres, Jaffe, Alquicira-Hernández, et al., 2015) on only these 24 samples. The actual script for running Rail-RNA on the GTEx data are described at the nellore/runs GitHub repository. The scripts run-railMatrix.sh and railMatrix.R then run railMatrix() using derfinder version 1.5.19 to identify the expressed regions. The resulting set of regions is then analyzed with the analyze_gtex.R script.

3 Simulation

3.1 Generating reads

The code for generating the simulated RNA-seq reads and the chosen setup is described in the generateReads report. This report is generated by the R Markdown generateReads.Rmd file.

3.2 Processing reads

We analyzed the simulation reads with the following pipelines:

Align with HISAT (Kim, Langmead, and Salzberg, 2015), summarize with Rsubread::featureCounts() at the exon-level with and without the complete annotation, identify differentially expressed exons with DESeq2 (Love, Huber, and Anders, 2014) or edgeR-robust (Zhou, Lindsay, and Robinson, 2014).
Align with HISAT, summarize transcripts with StringTie (Pertea, Pertea, Antonescu, Chang, et al., 2015), and test at the transcript and exon levels with ballgown.
Align with HISAT, summarize with derfinder::regionMatrix(), and test with DESeq2 or edgeR-robust.
Align with Rail-RNA (Nellore, Collado-Torres, Jaffe, Alquicira-Hernández, et al., 2015), summarize with derfinder::railMatrix(), and test with DESeq2 or edgeR-robust.

Here we list the role of different scripts.

The code for aligning the reads to the genome with HISAT is in the run-paired-hisat.sh script while the code for aligning with Rail-RNA is in prep-manifest.R and run-rail.sh scripts.
createGTF creates GTF file with the complete and incomplete annotation (source createGTF.Rmd).
The scripts run-featCounts.sh, featureCounts.R and run-featCounts-inc.sh, featureCounts-inc.R run Rsubread::featureCounts() at the exon level with the complete and incomplete annotation respectively.
Similarly, the scripts run-stringtie.sh and run-stringtie-inc.sh run StringTie with the complete and incomplete annotation creating the input needed to run ballgown.
The scripts run_ballgown_analysis.sh and ballgown_analysis.R then perform the ballgown (Frazee, Pertea, Jaffe, Langmead, et al., 2015) analyses at the transcript and exon levels.
The scripts run_regionMat.sh and regionMat.R run derfinder::regionMatrix() with the HISAT output.
Similarly, the scripts run_railMat.sh and railMat.R run derfinder::railMatrix() with the Rail-RNA output.
The scripts run_calc_stats.sh and calc_stats.R use the count matrices created by regionMatrix(), railMatrix() and featureCounts() to perform differential expression tests using DESeq2 and edgeR-robust.

3.3 Evaluating results

The report evaluate (source evaluate.Rmd) defines different reference sets one could consider. It then takes the results from all the different pipelines and compares them against these reference sets. The report includes summary tables from these results showing the minimum and maximum empirical power, false positive rate and false discovery rate. The main results are highlighted in the paper. Finally timing (source timing.Rmd) shows information about the timing and computer resources used by the different pipelines for the simulation analysis.

4 Miscellaneous

4.1 Expressed regions-level overview figure

The code used for generating the panels using in figure showing the expressed regions-level approach is available in the figure-expressed-regions.R file.

4.2 Single base-level overview figure

The code used for generating the panels using in the figure showing the single base-level approach is available in the figure-single-base.R file.

4.3 Additional analyses

The following R source files have the code for reproducing additional analyses described in the paper

analyze_brainspan.R and brainspan_regionLevel.R are the scripts containing the analysis of BrainSpan expressed regions-level DERs.
characterize_brainspan_DERs.R Analysis of BrainSpan single base-level DERs.

These scripts also include other exploratory code.

5 Reproducibility

Date this page was generated.

## [1] "2016-03-21 10:08:47 EDT"

Wallclock time spent generating the report.

## Time difference of 1.351 secs

R session information.

## Session info -----------------------------------------------------------------------------------------------------------

##  setting  value                       
##  version  R version 3.2.2 (2015-08-14)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2016-03-21

## Packages ---------------------------------------------------------------------------------------------------------------

##  package       * version  date       source        
##  bibtex          0.4.0    2014-12-31 CRAN (R 3.2.0)
##  BiocStyle     * 1.8.0    2015-10-14 Bioconductor  
##  bitops          1.0-6    2013-08-17 CRAN (R 3.2.0)
##  devtools        1.10.0   2016-01-23 CRAN (R 3.2.3)
##  digest          0.6.9    2016-01-08 CRAN (R 3.2.3)
##  evaluate        0.8      2015-09-18 CRAN (R 3.2.0)
##  formatR         1.2.1    2015-09-18 CRAN (R 3.2.0)
##  htmltools       0.3      2015-12-29 CRAN (R 3.2.3)
##  httr            1.1.0    2016-01-28 CRAN (R 3.2.3)
##  knitcitations * 1.0.7    2015-10-28 CRAN (R 3.2.0)
##  knitr           1.12.3   2016-01-22 CRAN (R 3.2.3)
##  lubridate       1.5.0    2015-12-03 CRAN (R 3.2.3)
##  magrittr        1.5      2014-11-22 CRAN (R 3.2.0)
##  memoise         1.0.0    2016-01-29 CRAN (R 3.2.3)
##  plyr            1.8.3    2015-06-12 CRAN (R 3.2.1)
##  R6              2.1.2    2016-01-26 CRAN (R 3.2.3)
##  Rcpp            0.12.3   2016-01-10 CRAN (R 3.2.3)
##  RCurl           1.95-4.7 2015-06-30 CRAN (R 3.2.1)
##  RefManageR      0.10.6   2016-02-15 CRAN (R 3.2.3)
##  RJSONIO         1.3-0    2014-07-28 CRAN (R 3.2.0)
##  rmarkdown     * 0.9.2    2016-01-01 CRAN (R 3.2.3)
##  stringi         1.0-1    2015-10-22 CRAN (R 3.2.0)
##  stringr         1.0.0    2015-04-30 CRAN (R 3.2.0)
##  XML             3.98-1.3 2015-06-30 CRAN (R 3.2.0)
##  yaml            2.1.13   2014-06-12 CRAN (R 3.2.0)

You can view the source R Markdown file for this page at index.Rmd.

6 Bibliography

This report was generated using BiocStyle (Morgan, Oleś, and Huber, 2016) with knitr (Xie, 2014) and rmarkdown (Allaire, Cheng, Xie, McPherson, et al., 2016) running behind the scenes.

Citations were made with knitcitations (Boettiger, 2015). Citation file: index.bib.

[1] J. Allaire, J. Cheng, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 0.9.2. 2016. URL: http://CRAN.R-project.org/package=rmarkdown.

[2] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.7. 2015. URL: http://CRAN.R-project.org/package=knitcitations.

[3] L. Collado-Torres, A. C. Frazee, M. I. Love, R. A. Irizarry, et al. “derfinder: Software for annotation-agnostic RNA-seq differential expression analysis”. In: bioRxiv (2015). DOI: 10.1101/015370. URL: http://www.biorxiv.org/content/early/2015/02/19/015370.abstract.

[4] A. Frazee. Efficiency analysis of Sun Grid Engine batch jobs. 2014. URL: http://dx.doi.org/10.6084/m9.figshare.878000.

[5] A. C. Frazee, G. Pertea, A. E. Jaffe, B. Langmead, et al. “Ballgown bridges the gap between transcriptome assembly and expression analysis”. In: Nature Biotechnology (2015).

[6] D. Kim, B. Langmead and S. L. Salzberg. “HISAT: a fast spliced aligner with low memory requirements”. In: Nature Methods (2015).

[7] M. I. Love, W. Huber and S. Anders. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2”. In: Genome Biology 15 (12 2014), p. 550. DOI: 10.1186/s13059-014-0550-8.

[8] M. Morgan, A. Oleś and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 1.8.0. 2016. URL: https://github.com/Bioconductor/BiocStyle.

[9] A. Nellore, L. Collado-Torres, A. E. Jaffe, J. Alquicira-Hernández, et al. “Rail-RNA: Scalable analysis of RNA-seq splicing and coverage”. In: bioRxiv (2015).

[10] M. Pertea, G. M. Pertea, C. M. Antonescu, T. Chang, et al. “StringTie enables improved reconstruction of a transcriptome from RNA-seq reads”. In: Nature Biotechnology (2015).

[11] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.

[12] X. Zhou, H. Lindsay and M. D. Robinson. “Robustly detecting differential expression in RNA sequencing data using observation weights”. In: Nucleic Acids Research 42 (2014), p. e91.

derfinder software paper Supplementary Website

L Collado-Torres

21 March 2016

Contents