Given a set of regions for a chromosome, compute the coverage matrix for a given SRA study.

Given a set of genomic regions as created by expressed_regions, this function computes the coverage matrix for a library size of 40 million 100 bp reads for a given SRA study.

coverage_matrix(
  project,
  chr,
  regions,
  chunksize = 1000,
  bpparam = NULL,
  outdir = NULL,
  chrlen = NULL,
  verbose = TRUE,
  verboseLoad = verbose,
  scale = TRUE,
  round = FALSE,
  ...
)

Arguments

project: A character vector with one SRA study id.
chr: A character vector with the name of the chromosome.
regions: A GRanges-class object with regions for chr for which to calculate the coverage matrix.
chunksize: A single integer vector defining the chunksize to use for computing the coverage matrix. Regions will be split into different chunks which can be useful when using a parallel instance as defined by bpparam.
bpparam: A BiocParallelParam-class instance which will be used to calculate the coverage matrix in parallel. By default, SerialParam-class will be used.
outdir: The destination directory for the downloaded file(s) that were previously downloaded with download_study. If the files are missing, but outdir is specified, they will get downloaded first. By default outdir is set to NULL which will use the data from the web. We only recommend downloading the full data if you will use it several times.
chrlen: The chromosome length in base pairs. If it's NULL, the chromosome length is extracted from the Rail-RNA runs GitHub repository. Alternatively check the SciServer section on the vignette to see how to access all the recount data via a R Jupyter Notebook.
verbose: If TRUE basic status updates will be printed along the way.
verboseLoad: If TRUE basic status updates for loading the data will be printed.
scale: If TRUE, the coverage counts will be scaled to read counts based on a library size of 40 million reads. Set scale to FALSE if you want the raw coverage counts. The scaling method is by AUC, as in the default option of scale_counts.
round: If TRUE, the counts are rounded to integers. Set to TRUE if you want to match the defaults of scale_counts.
...: Additional arguments passed to download_study when outdir is specified but the required files are missing.

Value

A RangedSummarizedExperiment-class object with the counts stored in the assays slot.

Details

When using outdir = NULL the information will be accessed from the web on the fly. If you encounter internet access problems, it might be best to first download the BigWig files using download_study. This might be the best option if you are accessing all chromosomes for a given project and/or are thinking of using different sets of regions (for example, from different cutoffs applied to expressed_regions). Alternatively check the SciServer section on the vignette to see how to access all the recount data via a R Jupyter Notebook.

If you have bwtool installed, you can use https://github.com/LieberInstitute/recount.bwtool for faster results. Note that you will need to run scale_counts after running coverage_matrix_bwtool().

Author

Leonardo Collado-Torres

Examples


## Workaround for https://github.com/lawremi/rtracklayer/issues/83
download_study("SRP002001", type = "mean")
#> 2024-12-12 22:14:49.834038 downloading file mean_SRP002001.bw to SRP002001/bw
download_study("SRP002001", type = "samples")
#> 2024-12-12 22:14:51.14582 downloading file SRR036661.bw to SRP002001/bw

## Define expressed regions for study DRP002835, chrY
regions <- expressed_regions("SRP002001", "chrY",
    cutoff = 5L,
    maxClusterGap = 3000L,
    outdir = "SRP002001"
)
#> 2024-12-12 22:14:52.089015 loadCoverage: loading BigWig file SRP002001/bw/mean_SRP002001.bw
#> 2024-12-12 22:14:52.173846 loadCoverage: applying the cutoff to the merged data
#> 2024-12-12 22:14:52.188612 filterData: originally there were 57227415 rows, now there are 57227415 rows. Meaning that 0 percent was filtered.
#> 2024-12-12 22:14:52.19185 findRegions: identifying potential segments
#> 2024-12-12 22:14:52.196581 findRegions: segmenting information
#> 2024-12-12 22:14:52.19707 .getSegmentsRle: segmenting with cutoff(s) 5
#> 2024-12-12 22:14:52.219663 findRegions: identifying candidate regions
#> 2024-12-12 22:14:53.121789 findRegions: identifying region clusters

## Now calculate the coverage matrix for this study
rse <- coverage_matrix("SRP002001", "chrY", regions, outdir = "SRP002001")
#> 2024-12-12 22:14:53.50165 downloading file SRP002001.tsv to SRP002001
#> 2024-12-12 22:14:54.100011 railMatrix: processing regions 1 to 744
#> 2024-12-12 22:14:54.110613 railMatrix: processing file SRP002001/bw/SRR036661.bw

## One row per region
identical(length(regions), nrow(rse))
#> [1] TRUE