Removing low complexity reads from RNA-seq
0
3
Entering edit mode
6.4 years ago
Asaf 8.6k

I'm mapping bacterial RNA-seq to the genome and found something very weird. A very abundant RNA has a AATAATAAT repeat somewhere in the middle and I have a lot of reads which map to the gene (it's paired end sequencing and the second mate maps) but the number of repeats is larger (up to 8 AAT repeats). Since I have several millions reads of this gene and it's a small-RNA I get thousands such reads. I'm trying to figure out their source, whether it's biological or just artifact of the RT/PCR/sequencing etc.

To overcome this issue I started screening the reads for low-complexity reads and removing them (using dust filter) which seems to work. My questions are:

  1. Is it common to remove low-complexity reads from the data?
  2. Why should they be removed? Is it because the mapping will be difficult or wrong or the reads are probably a result of an error?

Thanks

RNA-Seq low-complexity qc • 4.1k views
ADD COMMENT
0
Entering edit mode

Are you using a mapping quality filter before generating counts? using a mapQ threshold >10 will remove most low-complexity reads because they map to many places in the genome. HTSeq, Bedtools and FeatureCounts all have facility to do this.

ADD REPLY
0
Entering edit mode
My analysis is a bit different, I don't remove multiple mapped reads. In addition I work on bacteria so the genome is much smaller.
ADD REPLY
1
Entering edit mode

In that case, "Dusting" or "RepeatMasking" reads would seem appropriate.

ADD REPLY

Login before adding your answer.

Traffic: 2390 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6