Effects of random hexamer priming bias in RNA-Seq de novo assembly
2
4
Entering edit mode
6.8 years ago
Nestor Wendt ▴ 100

Hello,

I've been trying to assembly Illumina paired-end reads (2x100bp) from RNA-Seq, but after checking FastQC results I noticed a certain pattern in the first 12 bases: 

Looks like this pattern is caused by a not so random hexamer priming and thats normal and expected. Thus, the first bases are biased towards sequences that prime more efficiently.  http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896536/ and http://support.illumina.com/sequencing/faqs.html (search for: Why is GC high in the first few bases?)

I've tried assembling this reads before and after trimming the first 12 bases using Trinity with default settings:

  contigs N50 assembled bases
without trimming 269491 1083 165837915
trimming first 12 171249 1270 133008734

 

Trimming seems better at first sight. I also tried mapping the untrimmed reads in the assembly to check the match rate in the first bases using BBmap, as suggested by Brian Bushnell here: http://seqanswers.com/forums/showthread.php?t=11843&page=2

Match rate by read position:

Base pos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
all_left.fastq 0.93793 0.97391 0.97935 0.97992 0.98321 0.98266 0.98412 0.98521 0.98521 0.98643 0.98751 0.98761 0.98713 0.98808 0.98770
all_right.fastq 0.26800 0.93917 0.97231 0.97560 0.97726 0.97941 0.97856 0.98346 0.98445 0.98447 0.98320 0.98509 0.98278 0.98500 0.98770

 

Only the first base from all_right.fastq seems to have a high error rate.

Now I'm trying to make sense of these results, so my question is how does this kind of bias supposedly affects de novo assembly?

Thank you.

RNA-Seq Assembly Bias Illumina De novo assembly • 4.1k views
ADD COMMENT
1
Entering edit mode
6.8 years ago

That's interesting.  From the match rates you've displayed, I wonder if maybe your assembly would be just as good if you trimmed the first 2bp rather than the first 12bp.  If the lower continuity of the untrimmed data is coming from error bases, that should wipe most of it out.

Also, the read 2 base frequency histogram looks like the read 1 histogram just shifted by 1bp (read 1 has the T peak at position 7, read 2 has it at position 8, etc).  That, coupled with the fact that the first base of read 2 has a ~25% match rate - which you would expect from adapter sequence - makes me wonder if r2 sequencing is starting at the wrong location and the first base is actually adapter.

ADD COMMENT
1
Entering edit mode
6.8 years ago
Nestor Wendt ▴ 100

Hello,

I just finished a new assembly, trimming the first 2bp. Now I have this: 

  contigs N50 assembled bases
without trimming 269491 1083 165837915
trimming first 2 259761 1119 162273830
trimming first 12 171249 1270 133008734

 

Trimming the first 2bp had little effect. It's quite interesting. It really seems that the first 12 bases have 'equal' value lowering the continuity of the assembly, even with high match rates (~98%). I wonder if I trim, let's say, the first 20, N50 will continue to rise or it only happens when I trim the biased bases.

ADD COMMENT
0
Entering edit mode

Thanks for sharing that; it's clearly not what I expected.  I will suggest to my team that we look into the effects on our assemblies of trimming these bases.

ADD REPLY

Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6