This report shows the time and memory used to all the simulation pipelines. These include running derfinder
via regionMatrix()
and railMatrix()
from HISAT
(Kim, Langmead, and Salzberg, 2015) and Rail-RNA
(Nellore, Collado-Torres, Jaffe, Alquicira-Hernández, et al., 2015) output respectively and performing the statistical tests with DESeq2
(Love, Huber, and Anders, 2014) and edgeR
-robust . We also used ballgown
(Frazee, Pertea, Jaffe, Langmead, et al., 2015) for exon and transcript level tests using data summarized by StringTie
(Pertea, Pertea, Antonescu, Chang, et al., 2015) from the HISAT
alignments.
## Extract information from Gmail
system('cp ../../efficiency_analytics/client_secrets .')
system('python ../../efficiency_analytics/analyze_efficiency.py --email fellgernon@gmail.com --folder "Cluster/derSupp" --outfile timing-derSupp.txt')
## Load libraries
library("ggplot2")
library("knitr")
## Setup
## Read data and process it
all <- read.table('timing-derSupp.txt', header = TRUE, stringsAsFactors = FALSE)
all$software <- NA
all$software[grepl('^bg-', all$jobid)] <- 'Ballgown'
all$software[grepl('stats-', all$jobid)] <- 'DESeq2 & edgeR'
all$software[grepl('hisat', all$jobid)] <- 'HISAT'
all$software[grepl('make-railMat', all$jobid)] <- 'derfinder'
all$software[grepl('make-regMat', all$jobid)] <- 'derfinder'
all$software[grepl('rail-prep', all$jobid)] <- 'Rail-RNA'
all$software[grepl('rail-align', all$jobid)] <- 'Rail-RNA'
all$software[grepl('bg-no-assembly', all$jobid)] <- 'StringTie'
all$software[grepl('featCounts', all$jobid)] <- 'featureCounts'
all$software[grepl('genReads', all$jobid)] <- 'polyester'
## Remove unused data
all <- all[!is.na(all$software), ]
## Add replicate info
all$Replicate <- NA
all$Replicate[grepl('R1', all$jobid)] <- '1'
all$Replicate[grepl('R2', all$jobid)] <- '2'
all$Replicate[grepl('R3', all$jobid)] <- '3'
## Add memory info
all$memG <- all$memory
all$memG[all$memunit == "M"] <- all$memG[all$memunit == "M"] / 1024
## Cores info
all$cores <- 1L
all$cores[all$software %in% c('featureCounts', 'StringTie', 'HISAT')] <- 4L
all$cores[grepl('make-regMat', all$jobid)] <- 4L
all$cores[all$software == 'Rail-RNA'] <- 10L
all$timeByCore <- all$walltime * all$cores
all$memByCore <- all$memG / all$cores
## Types of analysis
all$analysis <- factor(ifelse(all$software %in% c('HISAT'), 'Align', ifelse(all$software %in% c('StringTie', 'featureCounts', 'derfinder'), 'Summarize', ifelse(all$software %in% c('Ballgown', 'DESeq2 & edgeR'), 'Statistical tests', ifelse(all$software == 'polyester', 'Simulate reads', ifelse(grepl('rail-prep', all$jobid) & all$software == 'Rail-RNA', 'Align prep', 'Align'))))))
all$experiment <- 'Simulation'
The following plots show the wall time and memory used by each job while taking into account the number of cores used by each job. Note that doing so is a crude approximation of how much time and memory each job would have needed had it ran on a single node.
Points are colored by the software used with shapes given by the analysis step.
## Walltime and memory adjusted by number of cores (it's an approximation)
ggplot(all, aes(x=timeByCore, y=memByCore, colour=software, shape=analysis)) + geom_point(size = 3) + xlab("Wall time (hrs) multiplied by the number of cores") + ylab("Memory (GB) divided by the number of cores") + scale_colour_brewer(palette="Dark2") + theme_bw(base_size = 18) + theme(legend.position=c(.65, .35))
time <- ggplot(all, aes(x=log2(timeByCore), y=memByCore, colour=software, shape=analysis)) + geom_point(size = 3) + xlab("Wall time (hrs) multiplied by the number of cores (log2)") + ylab("Memory (GB) divided by the number of cores") + scale_colour_brewer(palette="Dark2") + theme_bw(base_size = 18) + theme(legend.position=c(.25, .65))
## For supp text
time
pdf(file = 'time.pdf', width = 10)
time
dev.off()
## quartz_off_screen
## 2
#system('open time.pdf')
getInfo <- function(df, sumTime = FALSE, peakCores = FALSE) {
memByCore <- max(df$memByCore)
walltime <- ifelse(sumTime, sum(df$walltime), max(df$walltime)) * 60
memG <- max(df$memG)
peakCores <- ifelse(peakCores, max(df$peakCores), sum(df$cores))
res <- c(memByCore = memByCore, walltime = walltime, memG = memG, peakCores = peakCores)
return(res)
}
analysisInfo <- list(
'Expressed-regions (Rail-RNA)' = grepl('make-railMat|stats-railMatrix', all$jobid) | all$software == 'Rail-RNA',
'Expressed-regions (HISAT)' = grepl('make-regMat|stats-regionMatrix', all$jobid) | all$software == 'HISAT',
'Feature counts' = grepl('stats-featCount', all$jobid) | all$software %in% c('HISAT', 'featureCounts'),
'Ballgown' = all$software %in% c('HISAT', 'Ballgown', 'StringTie')
)
## Summarize the information for each step of each analysis
analysisSummary <- lapply(names(analysisInfo), function(pipeline) {
current <- all[analysisInfo[[pipeline]], ]
res_pipeline <- lapply(c('1', '2', '3'), function(rep) {
use <- subset(current, Replicate == rep)
if(nrow(use) == 0) return(NULL)
res_rep <- lapply(unique(use$analysis), function(analysis) {
res_step <- as.data.frame(t(getInfo(use[use$analysis == analysis, ], sumTime = TRUE)))
res_step$analysis <- analysis
res_step$Replicate <- rep
res_step$Pipeline <- pipeline
return(res_step)
})
res_rep <- do.call(rbind, res_rep)
return(res_rep)
})
res_pipeline <- do.call(rbind, res_pipeline)
return(res_pipeline)
})
analysisSummary <- do.call(rbind, analysisSummary)
The table shown below shows per analysis the maximum memory used by a job and the total wall time (in minutes) for that step assuming jobs ran sequentially. This table can be useful to find the peak number of cores (the sum of cores for all jobs running simultaneously) for a given analysis step.
kable(analysisSummary, format = 'markdown', digits = c(2, 2, 2))
memByCore | walltime | memG | peakCores | analysis | Replicate | Pipeline |
---|---|---|---|---|---|---|
0.62 | 1.52 | 0.62 | 1 | Statistical tests | 1 | Expressed-regions (Rail-RNA) |
1.41 | 1.55 | 1.41 | 1 | Summarize | 1 | Expressed-regions (Rail-RNA) |
32.77 | 142.88 | 327.71 | 10 | Align | 1 | Expressed-regions (Rail-RNA) |
3.00 | 3.23 | 29.97 | 10 | Align prep | 1 | Expressed-regions (Rail-RNA) |
0.62 | 1.33 | 0.62 | 1 | Statistical tests | 2 | Expressed-regions (Rail-RNA) |
1.41 | 1.50 | 1.41 | 1 | Summarize | 2 | Expressed-regions (Rail-RNA) |
34.82 | 218.40 | 348.18 | 10 | Align | 2 | Expressed-regions (Rail-RNA) |
3.06 | 3.18 | 30.62 | 10 | Align prep | 2 | Expressed-regions (Rail-RNA) |
0.62 | 1.88 | 0.62 | 1 | Statistical tests | 3 | Expressed-regions (Rail-RNA) |
1.41 | 1.48 | 1.41 | 1 | Summarize | 3 | Expressed-regions (Rail-RNA) |
39.06 | 137.55 | 390.60 | 10 | Align | 3 | Expressed-regions (Rail-RNA) |
2.82 | 2.15 | 28.20 | 10 | Align prep | 3 | Expressed-regions (Rail-RNA) |
0.61 | 1.52 | 0.61 | 1 | Statistical tests | 1 | Expressed-regions (HISAT) |
0.79 | 5.02 | 3.16 | 4 | Summarize | 1 | Expressed-regions (HISAT) |
3.22 | 72.15 | 12.88 | 40 | Align | 1 | Expressed-regions (HISAT) |
0.60 | 1.88 | 0.60 | 1 | Statistical tests | 2 | Expressed-regions (HISAT) |
0.79 | 7.02 | 3.18 | 4 | Summarize | 2 | Expressed-regions (HISAT) |
3.22 | 47.98 | 12.88 | 40 | Align | 2 | Expressed-regions (HISAT) |
0.61 | 1.62 | 0.61 | 1 | Statistical tests | 3 | Expressed-regions (HISAT) |
0.79 | 6.88 | 3.15 | 4 | Summarize | 3 | Expressed-regions (HISAT) |
3.22 | 47.22 | 12.88 | 40 | Align | 3 | Expressed-regions (HISAT) |
0.62 | 5.28 | 0.62 | 2 | Statistical tests | 1 | Feature counts |
2.21 | 1.60 | 8.84 | 8 | Summarize | 1 | Feature counts |
3.22 | 72.15 | 12.88 | 40 | Align | 1 | Feature counts |
0.61 | 3.68 | 0.61 | 2 | Statistical tests | 2 | Feature counts |
2.21 | 1.58 | 8.84 | 8 | Summarize | 2 | Feature counts |
3.22 | 47.98 | 12.88 | 40 | Align | 2 | Feature counts |
0.62 | 3.70 | 0.62 | 2 | Statistical tests | 3 | Feature counts |
2.21 | 1.57 | 8.85 | 8 | Summarize | 3 | Feature counts |
3.22 | 47.22 | 12.88 | 40 | Align | 3 | Feature counts |
0.71 | 0.77 | 0.71 | 2 | Statistical tests | 1 | Ballgown |
2.11 | 11.02 | 8.42 | 80 | Summarize | 1 | Ballgown |
3.22 | 72.15 | 12.88 | 40 | Align | 1 | Ballgown |
0.70 | 0.67 | 0.70 | 2 | Statistical tests | 2 | Ballgown |
2.10 | 9.22 | 8.41 | 68 | Summarize | 2 | Ballgown |
3.22 | 47.98 | 12.88 | 40 | Align | 2 | Ballgown |
0.71 | 0.67 | 0.71 | 2 | Statistical tests | 3 | Ballgown |
2.11 | 8.65 | 8.43 | 60 | Summarize | 3 | Ballgown |
3.22 | 47.22 | 12.88 | 40 | Align | 3 | Ballgown |
getRange_helper <- function(x) {
x <- round(range(x), digits = 1)
paste0('(', x[1], '-', x[2], ')')
}
getRange <- function(df) {
data.frame(
memByCore = getRange_helper(df$memByCore),
walltime = getRange_helper(df$walltime),
memG = getRange_helper(df$memG),
peakCores = max(df$peakCores)
)
}
## Summary the information for each analysis
peaks <- lapply(names(analysisInfo), function(pipeline) {
res_pipeline <- lapply(unique(analysisSummary$analysis), function(analysis) {
current <- analysisSummary[analysisSummary$analysis == analysis & analysisSummary$Pipeline == pipeline, ]
if(nrow(current) == 0) return(NULL)
res_analysis <- getRange(current)
res_analysis$Pipeline <- pipeline
res_analysis$Analysis <- analysis
return(res_analysis)
})
res_pipeline <- do.call(rbind, res_pipeline)
return(res_pipeline)
})
peaks <- do.call(rbind, peaks)
peaks$Pipeline[peaks$Pipeline == 'Ballgown' & peaks$Analysis == 'Align'] <- 'HISAT: ERs, FeatureCounts, Ballgown'
peaks <- peaks[-which(peaks$Pipeline %in% c('Expressed-regions (HISAT)', 'Feature counts') & peaks$Analysis == 'Align'), ]
save(peaks, file = 'peaks.Rdata')
We can further summarize the resources used by each analysis by identified the maximum memory used in the steps required for a particular analysis and the total wall time (in minutes) for running all the steps when all the jobs of a particular step are running simultaneously. Thus giving us the total actual wall time to run a specific analysis and the maximum memory required.
The table below shows the final summary with the range (minimum, maximum) for the three simulation replicates.
kable(peaks, format = 'markdown', row.names = FALSE)
memByCore | walltime | memG | peakCores | Pipeline | Analysis |
---|---|---|---|---|---|
(0.6-0.6) | (1.3-1.9) | (0.6-0.6) | 1 | Expressed-regions (Rail-RNA) | Statistical tests |
(1.4-1.4) | (1.5-1.5) | (1.4-1.4) | 1 | Expressed-regions (Rail-RNA) | Summarize |
(32.8-39.1) | (137.6-218.4) | (327.7-390.6) | 10 | Expressed-regions (Rail-RNA) | Align |
(2.8-3.1) | (2.1-3.2) | (28.2-30.6) | 10 | Expressed-regions (Rail-RNA) | Align prep |
(0.6-0.6) | (1.5-1.9) | (0.6-0.6) | 1 | Expressed-regions (HISAT) | Statistical tests |
(0.8-0.8) | (5-7) | (3.2-3.2) | 4 | Expressed-regions (HISAT) | Summarize |
(0.6-0.6) | (3.7-5.3) | (0.6-0.6) | 2 | Feature counts | Statistical tests |
(2.2-2.2) | (1.6-1.6) | (8.8-8.9) | 8 | Feature counts | Summarize |
(0.7-0.7) | (0.7-0.8) | (0.7-0.7) | 2 | Ballgown | Statistical tests |
(2.1-2.1) | (8.7-11) | (8.4-8.4) | 80 | Ballgown | Summarize |
(3.2-3.2) | (47.2-72.1) | (12.9-12.9) | 40 | HISAT: ERs, FeatureCounts, Ballgown | Align |
The following table shows the details of the resources used by the different jobs. It shows the analysis step (analysis), the simulation replicate (Replicate), wall time used (shown in minutes, walltime), number of cores used (cores), memory in GB used (memG), software used (software) and the job name (jobib). Furthermore, it shows two simple approximations:
library("DT")
## Print whole table
all$walltime <- all$walltime * 60
d <- all[, c("analysis", "Replicate", "walltime", "cores", "memG", "timeByCore", "memByCore", "software", "analysis", "jobid")]
datatable(d, options = list(pagingType='full_numbers', pageLength=50, scrollX='100%')) %>% formatRound(columns = c(3, 5:7), digits = 3)
Table made using DT
(Xie, 2015).
Date the report was generated.
## [1] "2016-03-21 15:39:47 EDT"
Wallclock time spent generating the report.
## Time difference of 4.142 secs
R
session information.
## Session info -----------------------------------------------------------------------------------------------------------
## setting value
## version R version 3.2.2 (2015-08-14)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/New_York
## date 2016-03-21
## Packages ---------------------------------------------------------------------------------------------------------------
## package * version date source
## bibtex 0.4.0 2014-12-31 CRAN (R 3.2.0)
## bitops 1.0-6 2013-08-17 CRAN (R 3.2.0)
## colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0)
## devtools 1.10.0 2016-01-23 CRAN (R 3.2.3)
## digest 0.6.9 2016-01-08 CRAN (R 3.2.3)
## DT * 0.1 2015-06-09 CRAN (R 3.2.0)
## evaluate 0.8 2015-09-18 CRAN (R 3.2.0)
## formatR 1.2.1 2015-09-18 CRAN (R 3.2.0)
## ggplot2 * 2.0.0 2015-12-18 CRAN (R 3.2.3)
## gtable 0.1.2 2012-12-05 CRAN (R 3.2.0)
## highr 0.5.1 2015-09-18 CRAN (R 3.2.0)
## htmltools 0.3 2015-12-29 CRAN (R 3.2.3)
## htmlwidgets 0.5 2015-06-21 CRAN (R 3.2.1)
## httr 1.1.0 2016-01-28 CRAN (R 3.2.3)
## jsonlite 0.9.19 2015-11-28 CRAN (R 3.2.2)
## knitcitations * 1.0.7 2015-10-28 CRAN (R 3.2.0)
## knitr * 1.12.3 2016-01-22 CRAN (R 3.2.3)
## labeling 0.3 2014-08-23 CRAN (R 3.2.0)
## lubridate 1.5.0 2015-12-03 CRAN (R 3.2.3)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
## memoise 1.0.0 2016-01-29 CRAN (R 3.2.3)
## munsell 0.4.3 2016-02-13 CRAN (R 3.2.3)
## plyr 1.8.3 2015-06-12 CRAN (R 3.2.1)
## R6 2.1.2 2016-01-26 CRAN (R 3.2.3)
## RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.2.0)
## Rcpp 0.12.3 2016-01-10 CRAN (R 3.2.3)
## RCurl 1.95-4.7 2015-06-30 CRAN (R 3.2.1)
## RefManageR 0.10.6 2016-02-15 CRAN (R 3.2.3)
## RJSONIO 1.3-0 2014-07-28 CRAN (R 3.2.0)
## rmarkdown * 0.9.5 2016-02-22 CRAN (R 3.2.3)
## scales 0.3.0 2015-08-25 CRAN (R 3.2.0)
## stringi 1.0-1 2015-10-22 CRAN (R 3.2.0)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
## XML 3.98-1.3 2015-06-30 CRAN (R 3.2.0)
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)
This report was generated using rmarkdown
(Allaire, Cheng, Xie, McPherson, et al., 2016) with knitr
(Xie, 2014) running behind the scenes. Timing information extracted from the SGE reports using efficiency analytics
(Frazee, 2014). Figures and citations were made using ggplot2
(Wickham, 2009) and knitcitations
(Boettiger, 2015) respectively.
Citation file: timing.bib
[1] J. Allaire, J. Cheng, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 0.9.5. 2016. URL: http://CRAN.R-project.org/package=rmarkdown.
[2] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.7. 2015. URL: http://CRAN.R-project.org/package=knitcitations.
[3] A. Frazee. Efficiency analysis of Sun Grid Engine batch jobs. 2014. URL: http://dx.doi.org/10.6084/m9.figshare.878000.
[4] A. C. Frazee, G. Pertea, A. E. Jaffe, B. Langmead, et al. “Ballgown bridges the gap between transcriptome assembly and expression analysis”. In: Nature Biotechnology (2015).
[5] D. Kim, B. Langmead and S. L. Salzberg. “HISAT: a fast spliced aligner with low memory requirements”. In: Nature Methods (2015).
[6] M. I. Love, W. Huber and S. Anders. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2”. In: Genome Biology 15 (12 2014), p. 550. DOI: 10.1186/s13059-014-0550-8.
[7] A. Nellore, L. Collado-Torres, A. E. Jaffe, J. Alquicira-Hernández, et al. “Rail-RNA: Scalable analysis of RNA-seq splicing and coverage”. In: bioRxiv (2015).
[8] M. Pertea, G. M. Pertea, C. M. Antonescu, T. Chang, et al. “StringTie enables improved reconstruction of a transcriptome from RNA-seq reads”. In: Nature Biotechnology (2015).
[9] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009. ISBN: 978-0-387-98140-6. URL: http://had.co.nz/ggplot2/book.
[10] Y. Xie. DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.1. 2015. URL: http://CRAN.R-project.org/package=DT.
[11] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.