I've been trying to assembly Illumina paired-end reads (2x100bp) from RNA-Seq, but after checking FastQC results I noticed a certain pattern in the first 12 bases:
Looks like this pattern is caused by a not so random hexamer priming and thats normal and expected. Thus, the first bases are biased towards sequences that prime more efficiently. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896536/ and http://support.illumina.com/sequencing/faqs.html (search for: Why is GC high in the first few bases?)
I've tried assembling this reads before and after trimming the first 12 bases using Trinity with default settings:
|trimming first 12||171249||1270||133008734|
Trimming seems better at first sight. I also tried mapping the untrimmed reads in the assembly to check the match rate in the first bases using BBmap, as suggested by Brian Bushnell here: http://seqanswers.com/forums/showthread.php?t=11843&page=2
Match rate by read position:
Only the first base from all_right.fastq seems to have a high error rate.
Now I'm trying to make sense of these results, so my question is how does this kind of bias supposedly affects de novo assembly?