3.2 years ago
Illinu


I have an RNA-Seq dataset from a library with average fragment size 250 bp that was sequenced in PE150 mode. The fastqc report shows a high amount of overrepresented sequences marked with no hit. What is puzzling is that the percentage of overrepresented sequences is very different between read 1 and read 2 (ratio from 2 to 5 fold different) and it is always lower in read 2.

Is there an explanation to this observation? I think that if one fragment is overrepresented (for example a rRNA fragment) it beign sequenced from both ends, both read files should show a similar %.

Any hints would be highly appreciated.

Thanks, Illinu

Probably read 2 has a slightly lower quality and higher error rate, this higher error rate will make the kmers diverge randomly, thus the true over-represented kmer will have lower abundance on read 2.

Hi h.mon, thanks for replying. I thought about that too but both reads have great quality scores accross the whole read. Both reads go down slighy towards the last 10 bp but the lowest phred score is around 34 and there are some error bars a bit lower in read 2 than in read 1. But I have a hard time believing that those slight differences would give huge differences in overrepresentation. There are samples with 50% OR in read 1 and 25% in read 2.


