Question

Reads vs library size for DEA

0

Entering edit mode

21 months ago

backfish • 0

Hey,

for robust differential expression analysis (DEA) many factors need to be considert (number of replicates, genome complexity etc.). One important parameter is the number of reads per sample. But what does that actually refer to? The number of sequenced reads per sample or all reads after QC and mapping which I refer to as library size?

I think many estimates refer to the sequencing depth (5-30 Million reads) instead of the final library size. But I would be interssted in the library size required for DEA (DESeq2 or Limma).

For example: 3 biological replicates (mouse) and two conditions with each sample having a final library size of ~1 million reads. Would that be sufficient for DEA?

Many thanks for your ideas, Flo

DEA RNAseq • 549 views

ADD COMMENT • link updated 21 months ago by Papyrus ★ 2.9k • written 21 months ago by backfish • 0

score 1 · Answer 1 · 2022-07-29

There's not really an specific threshold to the number of reads to be "able" to do the analyses. The number of reads will simply condition your power (how many genes you detect, the size of the changes you are able to measure, etc.).

In my experience, for a genome of the size of mouse, typically >30M or >50M raw library reads are recommended for gene-level or isoform-level analyses, respectively. Of course, these recommendations assume that your libraries are of reasonable quality and you won't lose a lot of reads during preprocessing and alignment. Losing up to 15-50% of reads may be reasonable (at least from my experience) so that you may end up with final counts of e.g. 20, 40M per sample for decent libraries. That's pretty far away from 1M counts so I would judge that to be quite a low number of reads, in general. This will severely limit your comparisons to a reduced number of genes (of very high expression, probably constitutive or tissue specific) which may not be of interest. There's no harm for trying to do the comparisons, but be aware that it is not the same to have 1M final counts if you started from a 2M read library than if you started from a 50M library. In the second case, it is evident that something went wrong with the experiment as you have lost 98% of your initial reads.