Question: What Metrics Are Best To Describe The "Coverage" Of Rna-Seq Data?
gravatar for Bio_X2Y
9.7 years ago by
Bio_X2Y3.8k wrote:

As far as I know, DNA sequencing coverage is simply calculated as (read count x read length / genome size). This means, for example, that an experiment might be described as "40x coverage".

For RNA-seq, my initial instinct would have been to calculate a "coverage" figure in much the same way, but dividing by the number of annotated bases rather than full genome length. When I think about it, however, this figure seems fairly meaningless for RNA-seq. From what I understand, DNA sequencing coverage is fairly uniform across the genome (telomeres and repeat regions excepted), so the simple coverage figure is informative - you can be reasonably sure that a typical base in a "40x coverage" genome will be represented by about 40 reads.

For RNA-seq, however, the variation in transcript levels surely makes this figure meaningless? A coverage of "10x" would tell you next to nothing about the number of reads supporting a typical base.

So, my question is - is there any other metric that people would use to convey the strength/coverage of their data set? Would "read count" still be the best available? Has anybody used anything like the read count associated with a housekeeping gene(s) to compare coverage?


sequencing rpkm next-gen rna data • 24k views
ADD COMMENTlink modified 4.3 years ago by Biostar ♦♦ 20 • written 9.7 years ago by Bio_X2Y3.8k
gravatar for Michael Dondrup
9.7 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Thanks for this question.

You are correct that the intuitive interpretation of "coverage" for RNA-seq data is problematic. The reason for this is imo: We do not know the size/length of the transcriptome. This is essential when comparing e.g. two different samples by RNA-seq. In theory this means that two samples are directly comparable with respect to coverage, if and only if the combined length of all transcripts (molecules concatenated) in both samples is equal. Otherwise, this needs to be normalized. Thus, in this context I prefer to talk about "pseudo-coverage" rather. This is especially true when comparing on the gene level.

I recommend the study of Bullard et al. [1] where they compared different approaches to normalization (e.g. RPKM, upper quartile normalization, quantile normalization, house-keeping gene normalization). Short conclusion, they prefer upper quartile normalization over RPKM.

One should maybe also re-phrase the question bit to something like: "What is the the best way to accurately model RNA-seq data and its variability?" The consensus (really?) seems to be that variability in technical replicates can be modelled by a Poisson distribution, while biological samples show overdispersion and can be modelled by a negative binomial distribution, both Rpackages on RNA-seq (edgeR, DEseq) take this approach with some differences in parameter estimation and testing.

I have collected some papers for a seminar in a Mendeley group:

You can find my slides here: Feedback is welcome...

ADD COMMENTlink written 9.7 years ago by Michael Dondrup47k

No, I am talking about DEseq, but there exists also a package called DEGseq which leads to confusion sometimes. As far as I understood, DEGseq uses an MA-plot based method and the samr algorithm. Here is the DEGseq application note for completeness:

ADD REPLYlink written 9.5 years ago by Michael Dondrup47k

Michael, thank you for sharing your slides. When I saw "DEseq" in your answer, I thought it was a typo. But after seeing "DEseq" again in your slides, I was confirmed that you made a tiny mistake. The package name should be "DEGseq".

ADD REPLYlink written 9.5 years ago by Dejian1.3k
gravatar for Gww
9.7 years ago by
Gww2.7k wrote:

You can't really use a metric like that to get a general idea of your coverage, because the level of expression of each of the genes can vary by sample.

Perhaps you could calculate the RPKM (Reads per KB per million reads) for each gene and plot their distribution? or for the distribution for a subset of all the genes.

ADD COMMENTlink written 9.7 years ago by Gww2.7k

In addition, an organism with differentiated cell types would not be expected to be expressing all genes in all tissues, so your expected sequence-space would shift by tissue. Also, there is a likely to be a sampling bias that varies with the laboratory protocol e.g. the 3' bias observed with polyA-selected sequences

ADD REPLYlink written 9.7 years ago by iw9oel_ad6.1k

I think it would be really difficult to come up with a single number. There are all kinds of sources of experimental bias that could occur between samples, even derived from the same genome. For example, maybe one sample had significantly less rRNA contamination. You'd expect that this sample would have higher coverage everything else than the same sample with more rRNA contamination.

ADD REPLYlink written 9.7 years ago by Gww2.7k

Nope, if you check the text of the paper I linked they describe it as "Reads per kilobase of exon model per million mapped reads (RPKM)"

ADD REPLYlink written 9.7 years ago by Gww2.7k

Thanks GWW. At the moment, that's basically what we're going internally. I was wondering there is a single "number" that could summerize the strength of an experiment in some way. E.g. if I heard that another group had sequenced the same genome as us, I'd be immediately interested in knowing which experiment has better coverage. At the moment, I'd probably just go with "read count" (assuming same sequencing platform, etc.) to get a feel for how they stack up to each other...

ADD REPLYlink written 9.7 years ago by Bio_X2Y3.8k

I though RPKM means "reads per kilobase mapped" not "reads per kb per million".

ADD REPLYlink written 9.7 years ago by Daniel Standage3.9k

If you have a normalized library (where the coverage of each transcript in the sample is expected to be uniform) I think the coverage calculated as in genome experiments could work and could be useful to compare the depth of the sequencing process between different experiments.

ADD REPLYlink written 9.5 years ago by Marina Manrique1.3k
gravatar for Ketil
9.7 years ago by
Ketil4.0k wrote:

Although there is some theory for the minimum coverage necessary for DNA assembly, in practice DNA sampling is far from uniform, and in practice, coverage doesn't really say much about the quality of the resulting genome assembly. The closest thing to genome coverage for RNA would be size of reads divided by size of transcriptome, but as you say, the distribution is more skewed for RNA than for genome, making it even more useless.

What exactly is it that you wish to measure? Do you wish to measure the RNA diversity in your sample, or do you wish to measure the effectiveness of sample normalization? From your description, it sounds like the latter. The distribution of transcripts tend to follow an approximate power law (Pareto), so you could perhaps fit the distribution to that, and measure the slope (which will be steeper in the non-normalized case)

ADD COMMENTlink written 9.7 years ago by Ketil4.0k
gravatar for Marina Manrique
9.5 years ago by
Marina Manrique1.3k
Marina Manrique1.3k wrote:

Hi Jeremy Leipzig,

As almost everyone says RPKM is the most used metric, also to describe gene expression level in RNA-seq experiments.

However I would like to point out that if you have a normalized library where the coverage is expected to be uniform in your transcripts, the calculation of the coverage as done in genome experiments (as you said dividing by the number of annotated bases) could be correct.

I think the key point is if the coverage is expected to be uniform in your data (as it's supposed to be in normalized libraries) or not,

ADD COMMENTlink written 9.5 years ago by Marina Manrique1.3k
gravatar for Larry_Parnell
9.7 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

AS GWW writes, RPKM is the metric I most often see. One does need to consider heterozygous alleles such that one may see 100 reads across a given exon of a gene, 60 of which are A at a certain polymorphic base and another 40 which may be C at that position.

The coverage parameter must be calculated on a per exon basis - because of alternate splicing.

ADD COMMENTlink written 9.7 years ago by Larry_Parnell16k

Two different isoforms of an mRNA can still use the same exon(s) in one part of the message and alternate exons in a different part. For example, alternate exon 1, but exons 2 through 5 are all identical. Thus, your reads for exons 2 through 5 tell you nothing about which isoform was actually expressed and detected by sequencing.

ADD REPLYlink written 9.5 years ago by Larry_Parnell16k

About calculating coverage parameter per exon. Why wouldn't you calculate based on transcripts? Transcripts should represent isoforms so alternate splicing should be considered. Anyway I'd like to have both coverage values, I think exon based coverage could be useful to check the transcript assembly

ADD REPLYlink written 9.5 years ago by Marina Manrique1.3k
gravatar for Darked89
9.7 years ago by
Barcelona, Spain
Darked894.2k wrote:

For a novel genome coverage described something along the lines: how many bases/putative exons does your RNA-Seq data cover at least 2x in total makes sense (subtract repetitive and ribosomal sequences). "Good" data set will let you predict more transcripts.

For expression studies you can try to plot two numbers: total length of genomic bases mapped at some threshold coverage (2x) -> see above total length of RNA-Seq bases successfully mapped

Things to watch are unspliced mRNAs, which will badly mess up these numbers.

ADD COMMENTlink written 9.7 years ago by Darked894.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1310 users visited in the last hour