This report shows the time and memory used to all the simulation pipelines. These include running derfinder via regionMatrix() and railMatrix() from HISAT (Kim, Langmead, and Salzberg, 2015) and Rail-RNA (Nellore, Collado-Torres, Jaffe, Alquicira-Hernández, et al., 2015) output respectively and performing the statistical tests with DESeq2 (Love, Huber, and Anders, 2014) and edgeR-robust . We also used ballgown (Frazee, Pertea, Jaffe, Langmead, et al., 2015) for exon and transcript level tests using data summarized by StringTie (Pertea, Pertea, Antonescu, Chang, et al., 2015) from the HISAT alignments.

Results

## Extract information from Gmail
system('cp ../../efficiency_analytics/client_secrets .')
system('python ../../efficiency_analytics/analyze_efficiency.py --email fellgernon@gmail.com --folder "Cluster/derSupp" --outfile timing-derSupp.txt')
## Load libraries
library("ggplot2")
library("knitr")
## Setup

## Read data and process it
all <- read.table('timing-derSupp.txt', header = TRUE, stringsAsFactors = FALSE)
all$software <- NA
all$software[grepl('^bg-', all$jobid)] <- 'Ballgown'
all$software[grepl('stats-', all$jobid)] <- 'DESeq2 & edgeR'
all$software[grepl('hisat', all$jobid)] <- 'HISAT'
all$software[grepl('make-railMat', all$jobid)] <- 'derfinder'
all$software[grepl('make-regMat', all$jobid)] <- 'derfinder'
all$software[grepl('rail-prep', all$jobid)] <- 'Rail-RNA'
all$software[grepl('rail-align', all$jobid)] <- 'Rail-RNA'
all$software[grepl('bg-no-assembly', all$jobid)] <- 'StringTie'
all$software[grepl('featCounts', all$jobid)] <- 'featureCounts'
all$software[grepl('genReads', all$jobid)] <- 'polyester'

## Remove unused data
all <- all[!is.na(all$software), ]

## Add replicate info
all$Replicate <- NA
all$Replicate[grepl('R1', all$jobid)] <- '1'
all$Replicate[grepl('R2', all$jobid)] <- '2'
all$Replicate[grepl('R3', all$jobid)] <- '3'

## Add memory info
all$memG <- all$memory
all$memG[all$memunit == "M"] <- all$memG[all$memunit == "M"] / 1024

## Cores info
all$cores <- 1L
all$cores[all$software %in% c('featureCounts', 'StringTie', 'HISAT')] <- 4L
all$cores[grepl('make-regMat', all$jobid)] <- 4L
all$cores[all$software == 'Rail-RNA'] <- 10L


all$timeByCore <- all$walltime * all$cores
all$memByCore <- all$memG / all$cores


## Types of analysis
all$analysis <- factor(ifelse(all$software %in% c('HISAT'), 'Align', ifelse(all$software %in% c('StringTie', 'featureCounts', 'derfinder'), 'Summarize', ifelse(all$software %in% c('Ballgown', 'DESeq2 & edgeR'), 'Statistical tests', ifelse(all$software == 'polyester', 'Simulate reads', ifelse(grepl('rail-prep', all$jobid) & all$software == 'Rail-RNA', 'Align prep', 'Align'))))))

all$experiment <- 'Simulation'

Adjusting by number of cores

The following plots show the wall time and memory used by each job while taking into account the number of cores used by each job. Note that doing so is a crude approximation of how much time and memory each job would have needed had it ran on a single node.

Points are colored by the software used with shapes given by the analysis step.

## Walltime and memory adjusted by number of cores (it's an approximation)
ggplot(all, aes(x=timeByCore, y=memByCore, colour=software, shape=analysis)) + geom_point(size = 3) + xlab("Wall time (hrs) multiplied by the number of cores") + ylab("Memory (GB) divided by the number of cores") + scale_colour_brewer(palette="Dark2") + theme_bw(base_size = 18) + theme(legend.position=c(.65, .35))

time <- ggplot(all, aes(x=log2(timeByCore), y=memByCore, colour=software, shape=analysis)) + geom_point(size = 3) + xlab("Wall time (hrs) multiplied by the number of cores (log2)") + ylab("Memory (GB) divided by the number of cores") + scale_colour_brewer(palette="Dark2") + theme_bw(base_size = 18) + theme(legend.position=c(.25, .65))

## For supp text
time

pdf(file = 'time.pdf', width = 10)
time
dev.off()
## quartz_off_screen 
##                 2
#system('open time.pdf')

Resources by step for each analysis

getInfo <- function(df, sumTime = FALSE, peakCores = FALSE) {
    memByCore <- max(df$memByCore)
    walltime <- ifelse(sumTime, sum(df$walltime), max(df$walltime)) * 60
    memG <- max(df$memG)
    peakCores <- ifelse(peakCores, max(df$peakCores), sum(df$cores))
    res <- c(memByCore = memByCore, walltime = walltime, memG = memG, peakCores = peakCores)
    return(res)
}

analysisInfo <- list(
    'Expressed-regions (Rail-RNA)' = grepl('make-railMat|stats-railMatrix', all$jobid) | all$software == 'Rail-RNA',
    'Expressed-regions (HISAT)' = grepl('make-regMat|stats-regionMatrix', all$jobid) | all$software == 'HISAT',
    'Feature counts' = grepl('stats-featCount', all$jobid) | all$software %in% c('HISAT', 'featureCounts'),
    'Ballgown' = all$software %in% c('HISAT', 'Ballgown', 'StringTie')
)

## Summarize the information for each step of each analysis
analysisSummary <- lapply(names(analysisInfo), function(pipeline) {
    current <- all[analysisInfo[[pipeline]], ]
    res_pipeline <- lapply(c('1', '2', '3'), function(rep) {
        use <- subset(current, Replicate == rep)
        if(nrow(use) == 0) return(NULL)
        res_rep <- lapply(unique(use$analysis), function(analysis) {
            res_step <- as.data.frame(t(getInfo(use[use$analysis == analysis, ], sumTime = TRUE)))
            res_step$analysis <- analysis
            res_step$Replicate <- rep
            res_step$Pipeline <- pipeline
            return(res_step)
        })
        res_rep <- do.call(rbind, res_rep)
        return(res_rep)
    })
    res_pipeline <- do.call(rbind, res_pipeline)
    return(res_pipeline)
})
analysisSummary <- do.call(rbind, analysisSummary)

The table shown below shows per analysis the maximum memory used by a job and the total wall time (in minutes) for that step assuming jobs ran sequentially. This table can be useful to find the peak number of cores (the sum of cores for all jobs running simultaneously) for a given analysis step.

kable(analysisSummary, format = 'markdown', digits = c(2, 2, 2))
memByCore walltime memG peakCores analysis Replicate Pipeline
0.62 1.52 0.62 1 Statistical tests 1 Expressed-regions (Rail-RNA)
1.41 1.55 1.41 1 Summarize 1 Expressed-regions (Rail-RNA)
32.77 142.88 327.71 10 Align 1 Expressed-regions (Rail-RNA)
3.00 3.23 29.97 10 Align prep 1 Expressed-regions (Rail-RNA)
0.62 1.33 0.62 1 Statistical tests 2 Expressed-regions (Rail-RNA)
1.41 1.50 1.41 1 Summarize 2 Expressed-regions (Rail-RNA)
34.82 218.40 348.18 10 Align 2 Expressed-regions (Rail-RNA)
3.06 3.18 30.62 10 Align prep 2 Expressed-regions (Rail-RNA)
0.62 1.88 0.62 1 Statistical tests 3 Expressed-regions (Rail-RNA)
1.41 1.48 1.41 1 Summarize 3 Expressed-regions (Rail-RNA)
39.06 137.55 390.60 10 Align 3 Expressed-regions (Rail-RNA)
2.82 2.15 28.20 10 Align prep 3 Expressed-regions (Rail-RNA)
0.61 1.52 0.61 1 Statistical tests 1 Expressed-regions (HISAT)
0.79 5.02 3.16 4 Summarize 1 Expressed-regions (HISAT)
3.22 72.15 12.88 40 Align 1 Expressed-regions (HISAT)
0.60 1.88 0.60 1 Statistical tests 2 Expressed-regions (HISAT)
0.79 7.02 3.18 4 Summarize 2 Expressed-regions (HISAT)
3.22 47.98 12.88 40 Align 2 Expressed-regions (HISAT)
0.61 1.62 0.61 1 Statistical tests 3 Expressed-regions (HISAT)
0.79 6.88 3.15 4 Summarize 3 Expressed-regions (HISAT)
3.22 47.22 12.88 40 Align 3 Expressed-regions (HISAT)
0.62 5.28 0.62 2 Statistical tests 1 Feature counts
2.21 1.60 8.84 8 Summarize 1 Feature counts
3.22 72.15 12.88 40 Align 1 Feature counts
0.61 3.68 0.61 2 Statistical tests 2 Feature counts
2.21 1.58 8.84 8 Summarize 2 Feature counts
3.22 47.98 12.88 40 Align 2 Feature counts
0.62 3.70 0.62 2 Statistical tests 3 Feature counts
2.21 1.57 8.85 8 Summarize 3 Feature counts
3.22 47.22 12.88 40 Align 3 Feature counts
0.71 0.77 0.71 2 Statistical tests 1 Ballgown
2.11 11.02 8.42 80 Summarize 1 Ballgown
3.22 72.15 12.88 40 Align 1 Ballgown
0.70 0.67 0.70 2 Statistical tests 2 Ballgown
2.10 9.22 8.41 68 Summarize 2 Ballgown
3.22 47.98 12.88 40 Align 2 Ballgown
0.71 0.67 0.71 2 Statistical tests 3 Ballgown
2.11 8.65 8.43 60 Summarize 3 Ballgown
3.22 47.22 12.88 40 Align 3 Ballgown

Resources for each analysis

getRange_helper <- function(x) { 
    x <- round(range(x), digits = 1)
    paste0('(', x[1], '-', x[2], ')')
}
getRange <- function(df) {
    data.frame(
        memByCore = getRange_helper(df$memByCore),
        walltime = getRange_helper(df$walltime),
        memG = getRange_helper(df$memG),
        peakCores = max(df$peakCores)
    )
}

## Summary the information for each analysis
peaks <- lapply(names(analysisInfo), function(pipeline) {
    res_pipeline <- lapply(unique(analysisSummary$analysis), function(analysis) {
        current <- analysisSummary[analysisSummary$analysis == analysis & analysisSummary$Pipeline == pipeline, ]
        if(nrow(current) == 0) return(NULL)
        res_analysis <- getRange(current)
        res_analysis$Pipeline <- pipeline
        res_analysis$Analysis <- analysis
        return(res_analysis)
    })
    res_pipeline <- do.call(rbind, res_pipeline)
    return(res_pipeline)
})
peaks <- do.call(rbind, peaks)

peaks$Pipeline[peaks$Pipeline == 'Ballgown' & peaks$Analysis == 'Align'] <- 'HISAT: ERs, FeatureCounts, Ballgown'
peaks <- peaks[-which(peaks$Pipeline %in% c('Expressed-regions (HISAT)', 'Feature counts') & peaks$Analysis == 'Align'), ]


save(peaks, file = 'peaks.Rdata')

We can further summarize the resources used by each analysis by identified the maximum memory used in the steps required for a particular analysis and the total wall time (in minutes) for running all the steps when all the jobs of a particular step are running simultaneously. Thus giving us the total actual wall time to run a specific analysis and the maximum memory required.

The table below shows the final summary with the range (minimum, maximum) for the three simulation replicates.

kable(peaks, format = 'markdown', row.names = FALSE)
memByCore walltime memG peakCores Pipeline Analysis
(0.6-0.6) (1.3-1.9) (0.6-0.6) 1 Expressed-regions (Rail-RNA) Statistical tests
(1.4-1.4) (1.5-1.5) (1.4-1.4) 1 Expressed-regions (Rail-RNA) Summarize
(32.8-39.1) (137.6-218.4) (327.7-390.6) 10 Expressed-regions (Rail-RNA) Align
(2.8-3.1) (2.1-3.2) (28.2-30.6) 10 Expressed-regions (Rail-RNA) Align prep
(0.6-0.6) (1.5-1.9) (0.6-0.6) 1 Expressed-regions (HISAT) Statistical tests
(0.8-0.8) (5-7) (3.2-3.2) 4 Expressed-regions (HISAT) Summarize
(0.6-0.6) (3.7-5.3) (0.6-0.6) 2 Feature counts Statistical tests
(2.2-2.2) (1.6-1.6) (8.8-8.9) 8 Feature counts Summarize
(0.7-0.7) (0.7-0.8) (0.7-0.7) 2 Ballgown Statistical tests
(2.1-2.1) (8.7-11) (8.4-8.4) 80 Ballgown Summarize
(3.2-3.2) (47.2-72.1) (12.9-12.9) 40 HISAT: ERs, FeatureCounts, Ballgown Align

Details

The following table shows the details of the resources used by the different jobs. It shows the analysis step (analysis), the simulation replicate (Replicate), wall time used (shown in minutes, walltime), number of cores used (cores), memory in GB used (memG), software used (software) and the job name (jobib). Furthermore, it shows two simple approximations:

  • timeByCore is the wall time (in hours) multiplied by the number of cores used. It is a very simple approximation for the wall time used had the job been ran on a single node. This approximation is known to be false, but it gives a basic idea.
  • memByCore is the memory (in GB) divided by the number of cores used. It is an approximation for the memory used had the job been ran on a single node.
library("DT")

## Print whole table
all$walltime <- all$walltime * 60
d <- all[, c("analysis", "Replicate", "walltime", "cores", "memG", "timeByCore", "memByCore", "software", "analysis", "jobid")]
datatable(d, options = list(pagingType='full_numbers', pageLength=50, scrollX='100%')) %>% formatRound(columns = c(3, 5:7), digits = 3)

Table made using DT (Xie, 2015).

Reproducibility

Date the report was generated.

## [1] "2016-03-21 15:39:47 EDT"

Wallclock time spent generating the report.

## Time difference of 4.142 secs

R session information.

## Session info -----------------------------------------------------------------------------------------------------------
##  setting  value                       
##  version  R version 3.2.2 (2015-08-14)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2016-03-21
## Packages ---------------------------------------------------------------------------------------------------------------
##  package       * version  date       source        
##  bibtex          0.4.0    2014-12-31 CRAN (R 3.2.0)
##  bitops          1.0-6    2013-08-17 CRAN (R 3.2.0)
##  colorspace      1.2-6    2015-03-11 CRAN (R 3.2.0)
##  devtools        1.10.0   2016-01-23 CRAN (R 3.2.3)
##  digest          0.6.9    2016-01-08 CRAN (R 3.2.3)
##  DT            * 0.1      2015-06-09 CRAN (R 3.2.0)
##  evaluate        0.8      2015-09-18 CRAN (R 3.2.0)
##  formatR         1.2.1    2015-09-18 CRAN (R 3.2.0)
##  ggplot2       * 2.0.0    2015-12-18 CRAN (R 3.2.3)
##  gtable          0.1.2    2012-12-05 CRAN (R 3.2.0)
##  highr           0.5.1    2015-09-18 CRAN (R 3.2.0)
##  htmltools       0.3      2015-12-29 CRAN (R 3.2.3)
##  htmlwidgets     0.5      2015-06-21 CRAN (R 3.2.1)
##  httr            1.1.0    2016-01-28 CRAN (R 3.2.3)
##  jsonlite        0.9.19   2015-11-28 CRAN (R 3.2.2)
##  knitcitations * 1.0.7    2015-10-28 CRAN (R 3.2.0)
##  knitr         * 1.12.3   2016-01-22 CRAN (R 3.2.3)
##  labeling        0.3      2014-08-23 CRAN (R 3.2.0)
##  lubridate       1.5.0    2015-12-03 CRAN (R 3.2.3)
##  magrittr        1.5      2014-11-22 CRAN (R 3.2.0)
##  memoise         1.0.0    2016-01-29 CRAN (R 3.2.3)
##  munsell         0.4.3    2016-02-13 CRAN (R 3.2.3)
##  plyr            1.8.3    2015-06-12 CRAN (R 3.2.1)
##  R6              2.1.2    2016-01-26 CRAN (R 3.2.3)
##  RColorBrewer    1.1-2    2014-12-07 CRAN (R 3.2.0)
##  Rcpp            0.12.3   2016-01-10 CRAN (R 3.2.3)
##  RCurl           1.95-4.7 2015-06-30 CRAN (R 3.2.1)
##  RefManageR      0.10.6   2016-02-15 CRAN (R 3.2.3)
##  RJSONIO         1.3-0    2014-07-28 CRAN (R 3.2.0)
##  rmarkdown     * 0.9.5    2016-02-22 CRAN (R 3.2.3)
##  scales          0.3.0    2015-08-25 CRAN (R 3.2.0)
##  stringi         1.0-1    2015-10-22 CRAN (R 3.2.0)
##  stringr         1.0.0    2015-04-30 CRAN (R 3.2.0)
##  XML             3.98-1.3 2015-06-30 CRAN (R 3.2.0)
##  yaml            2.1.13   2014-06-12 CRAN (R 3.2.0)

Bibliography

This report was generated using rmarkdown (Allaire, Cheng, Xie, McPherson, et al., 2016) with knitr (Xie, 2014) running behind the scenes. Timing information extracted from the SGE reports using efficiency analytics (Frazee, 2014). Figures and citations were made using ggplot2 (Wickham, 2009) and knitcitations (Boettiger, 2015) respectively.

Citation file: timing.bib

[1] J. Allaire, J. Cheng, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 0.9.5. 2016. URL: http://CRAN.R-project.org/package=rmarkdown.

[2] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.7. 2015. URL: http://CRAN.R-project.org/package=knitcitations.

[3] A. Frazee. Efficiency analysis of Sun Grid Engine batch jobs. 2014. URL: http://dx.doi.org/10.6084/m9.figshare.878000.

[4] A. C. Frazee, G. Pertea, A. E. Jaffe, B. Langmead, et al. “Ballgown bridges the gap between transcriptome assembly and expression analysis”. In: Nature Biotechnology (2015).

[5] D. Kim, B. Langmead and S. L. Salzberg. “HISAT: a fast spliced aligner with low memory requirements”. In: Nature Methods (2015).

[6] M. I. Love, W. Huber and S. Anders. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2”. In: Genome Biology 15 (12 2014), p. 550. DOI: 10.1186/s13059-014-0550-8.

[7] A. Nellore, L. Collado-Torres, A. E. Jaffe, J. Alquicira-Hernández, et al. “Rail-RNA: Scalable analysis of RNA-seq splicing and coverage”. In: bioRxiv (2015).

[8] M. Pertea, G. M. Pertea, C. M. Antonescu, T. Chang, et al. “StringTie enables improved reconstruction of a transcriptome from RNA-seq reads”. In: Nature Biotechnology (2015).

[9] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009. ISBN: 978-0-387-98140-6. URL: http://had.co.nz/ggplot2/book.

[10] Y. Xie. DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.1. 2015. URL: http://CRAN.R-project.org/package=DT.

[11] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.