Question

Gold standard for bulk RNA-seq downsampling - number of uniquely mapped reads

0

Entering edit mode

2.9 years ago

msimmer92 ▴ 300

I have a bulk RNA-seq dataset that has a bias. If you plot the number of total mapped reads and of uniquely mapped reads after the mapping, you see some (many) samples have very low uniquely mapped reads compared to the other ones (in between 2-3M reads). The other samples have way more - between 15-20 million uniquely mapped reads as an average. (additional info: mapping was done with STAR. total coverage of samples after sequencing is not bad, explored with FastQC)

That first group of samples I mentioned is important for the condition, for which the alternative of just kicking out those samples from the dataset is not desired. Downsampling the others to go to 3 million uniquely mapped reads for all was discussed. I was wondering if this is ok or if it´s too low.

Q: What is the minimum number of uniquely mapped reads required in bulk RNA-seq? Thank you.

uniquely RNA-seq mapping bulk downsampling • 2.1k views

ADD COMMENT • link 2.9 years ago by msimmer92 ▴ 300

score 2 · Accepted Answer · 2021-05-19

There is no minimum number of uniquely mapped reads for RNAseq. In general:

Power to detect differential expression for a gene is a function of the the number of reads mapping to it, which in turn is a function of the length of gene, its level of expression and the depth at which sequencing is performed. If your samples have fewer reads, you will only be able to detect DE in highly expressed genes. Often people are interested in regulatory genes, like transcription factors, which tend to have low expression levels.
You need fewer reads to do differential gene expression analysis than you do for other analyses, such as isoform discovery, differential transcript analysis/usage, splicing analysis etc.

ENCODE recommends that human samples should have at least 30million uniquely mapped reads for human samples. In practice, I find that you can succesfully do DE on fewer than that: 15-20million produces usable read out on the biological state of the sample. You will probably find that at 3 million, you are going to have lower quality data.

What is the mapping rate for your samples with only 3 million uniquely mapped reads - if you put the same number of reads in to the mapping, but get an order of magnitude fewer successfully mapped, then that suggests their is something else wrong with the data, other than just not many reads.

Finally, down sampling is not usually necessary for RNAseq analysis because DE tools are designed to normalize for differences in read coverage between samples. However, the default normalizations are less effective as the difference between samples increases, although you could explore other normalisation methods built into the DE tools.