RNA-Seq minimum read length recommendations
2
1
Entering edit mode
4.5 years ago
GLR ▴ 20

Hi all,

I've been browsing the literature for RNA-Seq QC recommendations and have largely come to the conclusion that I should avoid read trimming and/or quality filtering or else I face introducing bias into any transcript expression estimates, especially as my data is pretty good. However, there doesn't seem to be such a clear consensus on a good minimum read length except to say that overly short reads can cause spurious alignment. How would you define an overly short read? My original read length is 100bp, so would a minimum of 50bp be a good length? I do have a few shorter reads from adaptor removal and PhiX filtering and while I don't think there's many of them, I want to make sure I'm minimizing any introduced bias.

Thank you!

RNA-Seq • 7.6k views
ADD COMMENT
0
Entering edit mode

I should avoid read trimming and/or quality filtering

I am not sure that is the consensus.

ADD REPLY
0
Entering edit mode

I've read in a few papers, such as by Williams et al (2016) and McManes (2014) that doing anything but a gentle trim could introduce a level of bias into gene expression estimates, although this can somewhat be mitigated with read length filtering. My data is generally good but I will be working with data from a range of sources and so I am still on the fence about whether or not to do a gentle qtrim of the data.

ADD REPLY
0
Entering edit mode

The Williams paper uses Q40 cutoff and a minimum length of 1. Although technically possible, I would not consider those cutoffs reasonable.

The MacManes paper removed 25% of the dataset with trimming, which is fairly aggressive. However, adapter trimming was performed regardless of the PHRED cutoff.

ADD REPLY
5
Entering edit mode
4.5 years ago

Read length choice in RNA-Seq depends on the end goal of your experiment. If you are only interested in differential gene expression then (single-end) 50 bp should be enough. However if you are interested in alternative splicing or/and gene fusion events , then long (paired-end) reads (> 100b ) are crucial.

ADD COMMENT
0
Entering edit mode

Thank you! For GE estimate, do you think the minimum read length could change with paired-end reads?

ADD REPLY
1
Entering edit mode
4.5 years ago

Hi- For a discussion about adapter trimming you may find this thread useful Trimming adapter sequences - is it necessary?

there doesn't seem to be such a clear consensus on a good minimum read length except to say that overly short reads can cause spurious alignment

Spurious alignments should have zero mapping quality and these should be filtered after the alignment, for example during the assignment of reads to genes to produce the matrix of genes vs samples counts. So I wouldn't remove upfront short reads unless they are really short like 10 bp and these should be a tiny minority anyway.

I don't have data at hand just now but I think that even in mammalian genomes after a length of ~30 bp, using longer reads doesn't improve the mapping much. But if you are interested you could just try to cut your reads short and see how the mapping changes - these days aligners are so fast that it shouldn't take too long.

ADD COMMENT
1
Entering edit mode

even in mammalian genomes after a length of ~30 bp, using longer reads doesn't improve the mapping much

There was a nice plot here: How to find the shortest k-mer length that is unique in a large genome

And another related discussion: read length versus unique alignment rate

ADD REPLY
0
Entering edit mode

Thank you so much for the link to both discussions, they're both really helpful.

ADD REPLY
0
Entering edit mode

Great, thank you so much for the link to the discussion about adaptor trimming.

ADD REPLY

Login before adding your answer.

Traffic: 2660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6