I'm mapping bacterial RNA-seq to the genome and found something very weird. A very abundant RNA has a AATAATAAT repeat somewhere in the middle and I have a lot of reads which map to the gene (it's paired end sequencing and the second mate maps) but the number of repeats is larger (up to 8 AAT repeats). Since I have several millions reads of this gene and it's a small-RNA I get thousands such reads. I'm trying to figure out their source, whether it's biological or just artifact of the RT/PCR/sequencing etc.
To overcome this issue I started screening the reads for low-complexity reads and removing them (using dust filter) which seems to work. My questions are:
- Is it common to remove low-complexity reads from the data?
- Why should they be removed? Is it because the mapping will be difficult or wrong or the reads are probably a result of an error?