Question: Effects of random hexamer priming bias in RNA-Seq de novo assembly
4
gravatar for Nestor Wendt
4.1 years ago by
Nestor Wendt100
Brazil
Nestor Wendt100 wrote:

Hello,

I've been trying to assembly Illumina paired-end reads (2x100bp) from RNA-Seq, but after checking FastQC results I noticed a certain pattern in the first 12 bases: 

Looks like this pattern is caused by a not so random hexamer priming and thats normal and expected. Thus, the first bases are biased towards sequences that prime more efficiently.  http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896536/ and http://support.illumina.com/sequencing/faqs.html (search for: Why is GC high in the first few bases?)

I've tried assembling this reads before and after trimming the first 12 bases using Trinity with default settings:

  contigs N50 assembled bases
without trimming 269491 1083 165837915
trimming first 12 171249 1270 133008734

 

Trimming seems better at first sight. I also tried mapping the untrimmed reads in the assembly to check the match rate in the first bases using BBmap, as suggested by Brian Bushnell here: http://seqanswers.com/forums/showthread.php?t=11843&page=2

Match rate by read position:

Base pos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
all_left.fastq 0.93793 0.97391 0.97935 0.97992 0.98321 0.98266 0.98412 0.98521 0.98521 0.98643 0.98751 0.98761 0.98713 0.98808 0.98770
all_right.fastq 0.26800 0.93917 0.97231 0.97560 0.97726 0.97941 0.97856 0.98346 0.98445 0.98447 0.98320 0.98509 0.98278 0.98500 0.98770

 

Only the first base from all_right.fastq seems to have a high error rate.

Now I'm trying to make sense of these results, so my question is how does this kind of bias supposedly affects de novo assembly?

Thank you.

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Nestor Wendt100
1
gravatar for Brian Bushnell
4.1 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

That's interesting.  From the match rates you've displayed, I wonder if maybe your assembly would be just as good if you trimmed the first 2bp rather than the first 12bp.  If the lower continuity of the untrimmed data is coming from error bases, that should wipe most of it out.

Also, the read 2 base frequency histogram looks like the read 1 histogram just shifted by 1bp (read 1 has the T peak at position 7, read 2 has it at position 8, etc).  That, coupled with the fact that the first base of read 2 has a ~25% match rate - which you would expect from adapter sequence - makes me wonder if r2 sequencing is starting at the wrong location and the first base is actually adapter.

ADD COMMENTlink written 4.1 years ago by Brian Bushnell16k
1
gravatar for Nestor Wendt
4.1 years ago by
Nestor Wendt100
Brazil
Nestor Wendt100 wrote:

Hello,

I just finished a new assembly, trimming the first 2bp. Now I have this: 

  contigs N50 assembled bases
without trimming 269491 1083 165837915
trimming first 2 259761 1119 162273830
trimming first 12 171249 1270 133008734

 

Trimming the first 2bp had little effect. It's quite interesting. It really seems that the first 12 bases have 'equal' value lowering the continuity of the assembly, even with high match rates (~98%). I wonder if I trim, let's say, the first 20, N50 will continue to rise or it only happens when I trim the biased bases.

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Nestor Wendt100

Thanks for sharing that; it's clearly not what I expected.  I will suggest to my team that we look into the effects on our assemblies of trimming these bases.

ADD REPLYlink written 4.1 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1033 users visited in the last hour