This report shows the time and memory used to run derfinder
for single base resolution differential expression analysis. It also shows the same information for going from BAM files to getting ready to run DESeq
(Anders and Huber, 2010) by using samtools
(Unknown, 2015) to convert to SAM format and HTSeq
(Unknown, 2015) to make the count tables. Furthermore, this process was compared to using the summarizeOverlaps()
function from the GenomicRanges
(Lawrence, Huber, Pagès, Aboyoun, et al., 2013) package as well as using the coverageToExon()
function included in the derfinder
package [requires the output from the fullCov step].
The following plots show the wall time and memory used by each job while taking into account the number of cores used by each job. Note that doing so is a crude approximation of how much time and memory each job would have needed had it ran on a single node.
First points are colored by the actual step.
Secondly, points are colored by which analysis type they belong to. Note that the loading data step is required for the single-level and expressed-regions DER approaches as well as exon counting (with derfinder).
The following plots show the wall time and memory used, but do not take into account how many cores were used.
The following plots are similar to those from the previous section. The difference is that the size of the points is determined by the number of cores the job used.
The following plot splits the data by row panels which are determined by the number of cores used.
The following plots show per analysis the maximum memory used by a job and maximum wall time for that step. This is assuming that all jobs for a given step ran simultaneously. For example, that all jobs running derfinder::analyzeChr()
were running at the same time. Note that for some analyses relied on the same steps, like loading the data (fullCov).
Below are similar plots showing the peak memory by core usage instead of the actual peak memory. This takes into account the number of cores used to run each job.
The full table is shown below which can be useful to find the peak number of cores (the sum of cores for all jobs running simultaneously) for a given analysis step.
memByCore | walltime | memG | peakCores | step | experiment | analysis |
---|---|---|---|---|---|---|
4.87 | 9.1136 | 4.87 | 1 | derM | brainspan | Single-base DER |
2.38 | 183.7903 | 86.43 | 510 | derA | brainspan | Single-base DER |
0.69 | 0.0047 | 0.69 | 1 | derMod | brainspan | Single-base DER |
17.35 | 2.7008 | 173.50 | 10 | fullCov | brainspan | Single-base DER |
0.95 | 0.0122 | 0.95 | 1 | derM | simulation | Single-base DER |
1.03 | 0.0553 | 1.03 | 7 | derA | simulation | Single-base DER |
0.75 | 0.0069 | 0.75 | 1 | derMod | simulation | Single-base DER |
0.69 | 0.0375 | 6.93 | 10 | fullCov | simulation | Single-base DER |
1.32 | 0.0492 | 1.32 | 1 | derM | hippo | Single-base DER |
3.90 | 0.9697 | 7.80 | 48 | derA | hippo | Single-base DER |
3.25 | 0.0222 | 3.25 | 1 | derMod | hippo | Single-base DER |
1.29 | 0.1967 | 12.91 | 10 | fullCov | hippo | Single-base DER |
4.39 | 1.2494 | 4.39 | 1 | derM | snyder | Single-base DER |
5.14 | 2.3453 | 20.55 | 96 | derA | snyder | Single-base DER |
7.02 | 0.0558 | 7.02 | 2 | derMod | snyder | Single-base DER |
2.71 | 1.2539 | 27.10 | 10 | fullCov | snyder | Single-base DER |
77.92 | 7.0422 | 389.60 | 10 | regMat | brainspan | Expressed-region DER |
17.35 | 2.7008 | 173.50 | 10 | fullCov | brainspan | Expressed-region DER |
0.76 | 0.0169 | 3.80 | 10 | regMat | simulation | Expressed-region DER |
0.69 | 0.0375 | 6.93 | 10 | fullCov | simulation | Expressed-region DER |
2.07 | 0.2442 | 10.33 | 5 | regMat | hippo | Expressed-region DER |
1.29 | 0.1967 | 12.91 | 10 | fullCov | hippo | Expressed-region DER |
5.32 | 1.1131 | 26.62 | 5 | regMat | snyder | Expressed-region DER |
2.71 | 1.2539 | 27.10 | 10 | fullCov | snyder | Expressed-region DER |
149.80 | 3.4858 | 149.80 | 1 | derR | brainspan | HTML report |
31.98 | 2.1261 | 31.98 | 1 | derR | simulation | HTML report |
36.46 | 0.8094 | 36.46 | 1 | derR | hippo | HTML report |
37.20 | 0.4836 | 37.20 | 1 | derR | snyder | HTML report |
17.35 | 2.7008 | 173.50 | 10 | fullCov | brainspan | Exon count - derfinder |
198.61 | 3.7794 | 198.61 | 2 | covToEx | brainspan | Exon count - derfinder |
0.69 | 0.0375 | 6.93 | 10 | fullCov | simulation | Exon count - derfinder |
7.13 | 0.7556 | 7.13 | 2 | covToEx | simulation | Exon count - derfinder |
1.29 | 0.1967 | 12.91 | 10 | fullCov | hippo | Exon count - derfinder |
11.16 | 0.6286 | 11.16 | 2 | covToEx | hippo | Exon count - derfinder |
2.71 | 1.2539 | 27.10 | 10 | fullCov | snyder | Exon count - derfinder |
16.20 | 0.7375 | 16.20 | 2 | covToEx | snyder | Exon count - derfinder |
0.38 | 0.5672 | 0.38 | 31 | htseq | hippo | Exon count - HTSeq |
1.73 | 3.7153 | 1.73 | 1 | toSam | hippo | Exon count - HTSeq |
0.38 | 7.8933 | 0.38 | 20 | htseq | snyder | Exon count - HTSeq |
1.44 | 42.0253 | 1.44 | 1 | toSam | snyder | Exon count - HTSeq |
1.80 | 0.2967 | 43.24 | 24 | summOv | hippo | Exon count - GenomicRanges |
6.32 | 2.6850 | 63.24 | 10 | summOv | snyder | Exon count - GenomicRanges |
We can further summarize the resources used by each analysis by identified the maximum memory used in the steps required for a particular analysis and the total wall time for running all the steps when all the jobs of a particular step are running simultaneously. Thus giving us the total actual wall time to run a specific analysis and the maximum memory required.
Below are similar plots showing the peak memory by core instead of the actual peak memory for a given job.
The table below shows the final summary. Note that in some analyses, the peak memory is from the fullCov step. We did not focus on reducing the memory load of this step as we sacrificed memory for speed. We know that much lower memory limits can be achieved using 1 core instead of the 10 cores used.
memByCore | walltime | memG | peakCores | experiment | analysis |
---|---|---|---|---|---|
17.35 | 195.609 | 173.50 | 510 | brainspan | Single-base DER |
1.03 | 0.112 | 6.93 | 10 | simulation | Single-base DER |
3.90 | 1.238 | 12.91 | 48 | hippo | Single-base DER |
7.02 | 4.904 | 27.10 | 96 | snyder | Single-base DER |
77.92 | 9.743 | 389.60 | 10 | brainspan | Expressed-region DER |
0.76 | 0.054 | 6.93 | 10 | simulation | Expressed-region DER |
2.07 | 0.441 | 12.91 | 10 | hippo | Expressed-region DER |
5.32 | 2.367 | 27.10 | 10 | snyder | Expressed-region DER |
149.80 | 3.486 | 149.80 | 1 | brainspan | HTML report |
31.98 | 2.126 | 31.98 | 1 | simulation | HTML report |
36.46 | 0.809 | 36.46 | 1 | hippo | HTML report |
37.20 | 0.484 | 37.20 | 1 | snyder | HTML report |
198.61 | 6.480 | 198.61 | 10 | brainspan | Exon count - derfinder |
7.13 | 0.793 | 7.13 | 10 | simulation | Exon count - derfinder |
11.16 | 0.825 | 12.91 | 10 | hippo | Exon count - derfinder |
16.20 | 1.991 | 27.10 | 10 | snyder | Exon count - derfinder |
1.73 | 4.283 | 1.73 | 31 | hippo | Exon count - HTSeq |
1.44 | 49.919 | 1.44 | 20 | snyder | Exon count - HTSeq |
1.80 | 0.297 | 43.24 | 24 | hippo | Exon count - GenomicRanges |
6.32 | 2.685 | 63.24 | 10 | snyder | Exon count - GenomicRanges |
Regarding the high memory load for the HTML report, this could be significantly lowered by only loading the required coverage data used for the plots instead of the full output from the fullCov step. Other improvements could be made to the plotting functions, in particular derfinderPlot::plotCluster()
that would help reduce the peak memory.
The previous table can also be used to compare the sum of the time and peak memory used by the different steps to obtain the exon count table with the following software options.
derfinder
: includes resources used for reading coverage data in R
and then running creating a feature count matrix. We did so for
HTSeq
: includes resources used for generating sorted SAM files and then running HTSeq.summOv
: resources used for running GenomicRanges::summarizeOverlaps()
directly on the BAM files.The following table shows the details of the resources used by the different jobs. It shows the experiment (experiment), the analysis step (step), wall time used (shown in hours, walltime), number of cores used (cores), memory in GB used (memG), software used (software), analysis for which the step is used (analysis), and the job name (jobib). Furthermore, it shows two simple approximations:
These are the following analysis steps:
regionReport
.regionMatrix()
.GenomicRanges::summarizeOverlaps()
to generate exon count table.derfinder::coverageToExon()
for UCSC hg19 knownGene or GRCh37 p11 Ensembl annotation table.Table made using rCharts
(Vaidyanathan, 2013).
Date the report was generated.
## [1] "2015-03-30 22:39:33 EDT"
Wallclock time spent generating the report.
## Time difference of 17.223 secs
R
session information.
## setting value
## version R Under development (unstable) (2014-11-01 r66923)
## system x86_64, darwin10.8.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/New_York
## package * version date source
## bibtex 0.4.0 2014-12-31 CRAN (R 3.2.0)
## bitops 1.0.6 2013-08-17 CRAN (R 3.2.0)
## chron 2.3.45 2014-02-11 CRAN (R 3.2.0)
## colorspace 1.2.6 2015-03-11 CRAN (R 3.2.0)
## data.table * 1.9.4 2014-10-02 CRAN (R 3.2.0)
## devtools 1.6.1 2014-10-07 CRAN (R 3.2.0)
## digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
## evaluate 0.5.5 2014-04-29 CRAN (R 3.2.0)
## formatR 1.0 2014-08-25 CRAN (R 3.2.0)
## ggplot2 * 1.0.0 2014-05-21 CRAN (R 3.2.0)
## gtable 0.1.2 2012-12-05 CRAN (R 3.2.0)
## htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
## httr 0.5 2014-09-02 CRAN (R 3.2.0)
## knitcitations * 1.0.4 2014-11-03 Github (cboettig/knitcitations@508de74)
## knitr * 1.7 2014-10-13 CRAN (R 3.2.0)
## knitrBootstrap 1.0.0 2014-11-03 Github (jimhester/knitrBootstrap@76c41f0)
## labeling 0.3 2014-08-23 CRAN (R 3.2.0)
## lattice 0.20.30 2015-02-22 CRAN (R 3.2.0)
## lubridate 1.3.3 2013-12-31 CRAN (R 3.2.0)
## markdown 0.7.4 2014-08-24 CRAN (R 3.2.0)
## MASS 7.3.40 2015-03-21 CRAN (R 3.2.0)
## memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
## mime 0.3 2015-03-29 CRAN (R 3.2.0)
## munsell 0.4.2 2013-07-11 CRAN (R 3.2.0)
## plyr 1.8.1 2014-02-26 CRAN (R 3.2.0)
## proto 0.3.10 2012-12-22 CRAN (R 3.2.0)
## rCharts * 0.4.5 2014-12-17 Github (ramnathv/rCharts@929875d)
## RColorBrewer 1.1.2 2014-12-07 CRAN (R 3.2.0)
## Rcpp 0.11.5 2015-03-06 CRAN (R 3.2.0)
## RCurl 1.95.4.5 2014-12-28 CRAN (R 3.2.0)
## RefManageR 0.8.40 2014-10-29 CRAN (R 3.2.0)
## reshape2 1.4.1 2014-12-06 CRAN (R 3.2.0)
## RJSONIO 1.3.0 2014-07-28 CRAN (R 3.2.0)
## rmarkdown * 0.3.3 2014-09-17 CRAN (R 3.2.0)
## rstudioapi 0.2 2014-12-31 CRAN (R 3.2.0)
## scales 0.2.4 2014-04-22 CRAN (R 3.2.0)
## stringr 0.6.2 2012-12-06 CRAN (R 3.2.0)
## whisker 0.3.2 2013-04-28 CRAN (R 3.2.0)
## XML 3.98.1.1 2013-06-20 CRAN (R 3.2.0)
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)
This report was generated using knitrBootstrap
(Hester, 2014) with knitr
(Xie, 2014) and rmarkdown
(Allaire, McPherson, Xie, Wickham, et al., 2014) running behind the scenes. Timing information extracted from the SGE reports using efficiency analytics
(Frazee, 2014). Figures and citations were made using ggplot2
(Wickham, 2009) and knitcitations
(Boettiger, 2015) respectively.
[1] J. Allaire, J. McPherson, Y. Xie, H. Wickham, et al. rmarkdown: Dynamic Documents for R. R package version 0.3.3. 2014. URL: http://CRAN.R-project.org/package=rmarkdown.
[2] S. Anders and W. Huber. “Differential expression analysis for sequence count data”. In: Genome Biology 11 (2010), p. R106. DOI: 10.1186/gb-2010-11-10-r106. URL: http://genomebiology.com/2010/11/10/R106/.
[3] C. Boettiger. knitcitations: Citations for knitr markdown files. R package version 1.0.4. 2015. URL: https://github.com/cboettig/knitcitations.
[4] A. Frazee. Efficiency analysis of Sun Grid Engine batch jobs. 2014. URL: http://dx.doi.org/10.6084/m9.figshare.878000.
[5] J. Hester. knitrBootstrap: Knitr Bootstrap framework. R package version 1.0.0. 2014. URL: https://github.com/jimhester/.
[6] M. Lawrence, W. Huber, H. Pagès, P. Aboyoun, et al. “Software for Computing and Annotating Genomic Ranges”. In: PLoS Computational Biology 9 (8 2013). DOI: 10.1371/journal.pcbi.1003118. URL: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118}.
[7] Unknown. Unknown. http://samtools.sourceforge.net/. Accessed 2015-03-30. 2015. URL: http://samtools.sourceforge.net/.
[8] Unknown. Unknown. http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html. Accessed 2015-03-30. 2015. URL: http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html.
[9] R. Vaidyanathan. rCharts: Interactive Charts using Javascript Visualization Libraries. R package version 0.4.5. 2013.
[10] H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. ISBN: 978-0-387-98140-6. URL: http://had.co.nz/ggplot2/book.
[11] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.