I made a RNA-Seq mapping two ways: 1) using all reads, and 2) using collapsed reads.
I got a lower percentage of mapped reads in the second case and I can't understand why: collapsed reads represent unique reads, should they not map to same locations? I mean, the sequence is the same, only the quantity is varying (many per read type in the first case and only one per read type in the second case). Is there a simple explanation to this? (I used fastx_collapser to collapse reads)
bowtie2 all reads mapping
bowtie2 -p 6 -x genome_index -U seq_noAdapt.fastq.gz -S seq_noAdapt.sam 2>seq_noAdapt.log 27557790 reads; of these: 27557790 (100.00%) were unpaired; of these: 3488459 (12.66%) aligned 0 times 4281314 (15.54%) aligned exactly 1 time 19788017 (71.81%) aligned >1 times 87.34% overall alignment rate
bowtie2 reads collapsed mapping
bowtie2 -p 6 -x genome_index -f seq_noAdapt_collapsed.fasta -S seq_noAdapt_collapsed.sam 2>seq_noAdapt_collapsed.log 1166266 reads; of these: 1166266 (100.00%) were unpaired; of these: 504804 (43.28%) aligned 0 times 192571 (16.51%) aligned exactly 1 time 468891 (40.20%) aligned >1 times 56.72% overall alignment rate
Some notes about the dataset
- It's a old one, so it's single-end, 36bp reads.
- It's intend to identify short RNAs.
- As you can see above, there is a high number of multi-mapping reads and this is expected: the loci giving rise to them are repetitive (and the genome itself is very repetitive).