R/coverage_matrix.R
coverage_matrix.Rd
Given a set of genomic regions as created by expressed_regions, this function computes the coverage matrix for a library size of 40 million 100 bp reads for a given SRA study.
coverage_matrix(
project,
chr,
regions,
chunksize = 1000,
bpparam = NULL,
outdir = NULL,
chrlen = NULL,
verbose = TRUE,
verboseLoad = verbose,
scale = TRUE,
round = FALSE,
...
)
A character vector with one SRA study id.
A character vector with the name of the chromosome.
A GRanges-class object with regions
for chr
for which to calculate the coverage matrix.
A single integer vector defining the chunksize to use for
computing the coverage matrix. Regions will be split into different chunks
which can be useful when using a parallel instance as defined by
bpparam
.
A BiocParallelParam-class instance which will be used to calculate the coverage matrix in parallel. By default, SerialParam-class will be used.
The destination directory for the downloaded file(s) that were
previously downloaded with download_study. If the files are missing,
but outdir
is specified, they will get downloaded first. By default
outdir
is set to NULL
which will use the data from the web.
We only recommend downloading the full data if you will use it several times.
The chromosome length in base pairs. If it's NULL
, the
chromosome length is extracted from the Rail-RNA runs GitHub repository.
Alternatively check the SciServer
section on the vignette to see
how to access all the recount data via a R Jupyter Notebook.
If TRUE
basic status updates will be printed along the
way.
If TRUE
basic status updates for loading the data
will be printed.
If TRUE
, the coverage counts will be scaled to read
counts based on a library size of 40 million reads. Set scale
to
FALSE
if you want the raw coverage counts. The scaling method is by
AUC, as in the default option of scale_counts.
If TRUE
, the counts are rounded to integers. Set to
TRUE
if you want to match the defaults of scale_counts.
Additional arguments passed to download_study when
outdir
is specified but the required files are missing.
A RangedSummarizedExperiment-class object with the counts stored in the assays slot.
When using outdir = NULL
the information will be accessed
from the web on the fly. If you encounter internet access problems, it might
be best to first download the BigWig files using download_study. This
might be the best option if you are accessing all chromosomes for a given
project and/or are thinking of using different sets of regions
(for
example, from different cutoffs applied to expressed_regions).
Alternatively check the SciServer
section on the vignette to see
how to access all the recount data via a R Jupyter Notebook.
If you have bwtool
installed, you can use
https://github.com/LieberInstitute/recount.bwtool for faster results.
Note that you will need to run scale_counts after running
coverage_matrix_bwtool()
.
## Workaround for https://github.com/lawremi/rtracklayer/issues/83
download_study("SRP002001", type = "mean")
#> 2024-12-12 22:14:49.834038 downloading file mean_SRP002001.bw to SRP002001/bw
download_study("SRP002001", type = "samples")
#> 2024-12-12 22:14:51.14582 downloading file SRR036661.bw to SRP002001/bw
## Define expressed regions for study DRP002835, chrY
regions <- expressed_regions("SRP002001", "chrY",
cutoff = 5L,
maxClusterGap = 3000L,
outdir = "SRP002001"
)
#> 2024-12-12 22:14:52.089015 loadCoverage: loading BigWig file SRP002001/bw/mean_SRP002001.bw
#> 2024-12-12 22:14:52.173846 loadCoverage: applying the cutoff to the merged data
#> 2024-12-12 22:14:52.188612 filterData: originally there were 57227415 rows, now there are 57227415 rows. Meaning that 0 percent was filtered.
#> 2024-12-12 22:14:52.19185 findRegions: identifying potential segments
#> 2024-12-12 22:14:52.196581 findRegions: segmenting information
#> 2024-12-12 22:14:52.19707 .getSegmentsRle: segmenting with cutoff(s) 5
#> 2024-12-12 22:14:52.219663 findRegions: identifying candidate regions
#> 2024-12-12 22:14:53.121789 findRegions: identifying region clusters
## Now calculate the coverage matrix for this study
rse <- coverage_matrix("SRP002001", "chrY", regions, outdir = "SRP002001")
#> 2024-12-12 22:14:53.50165 downloading file SRP002001.tsv to SRP002001
#> 2024-12-12 22:14:54.100011 railMatrix: processing regions 1 to 744
#> 2024-12-12 22:14:54.110613 railMatrix: processing file SRP002001/bw/SRR036661.bw
## One row per region
identical(length(regions), nrow(rse))
#> [1] TRUE