Effects of random hexamer priming bias in RNA-Seq de novo assembly
1
4
Entering edit mode
8.0 years ago
Nestor Wendt ▴ 100

Hello,

I've been trying to assembly Illumina paired-end reads (2x100bp) from RNA-Seq, but after checking FastQC results I noticed a certain pattern in the first 12 bases:

Looks like this pattern is caused by a not so random hexamer priming and that's normal and expected. Thus, the first bases are biased towards sequences that prime more efficiently. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896536/ and http://support.illumina.com/sequencing/faqs.html (search for: Why is GC high in the first few bases?)

I've tried assembling this reads before and after trimming the first 12 bases using Trinity with default settings:

                      contigs     N50      assembled bases
without trimming      269491      1083     165837915
trimming first 12     171249      1270     133008734


Trimming seems better at first sight. I also tried mapping the untrimmed reads in the assembly to check the match rate in the first bases using BBmap, as suggested by Brian Bushnell here

Base pos.           1           2           3           4           5           6           7           8           9          10           11          12          13          14          15
all_left.fastq      0.93793     0.97391     0.97935     0.97992     0.98321     0.98266     0.98412     0.98521     0.98521     0.98643     0.98751     0.98761     0.98713     0.98808     0.98770
all_right.fastq     0.26800     0.93917     0.97231     0.97560     0.97726     0.97941     0.97856     0.98346     0.98445     0.98447     0.98320     0.98509     0.98278     0.98500     0.98770


Only the first base from all_right.fastq seems to have a high error rate.

Now I'm trying to make sense of these results, so my question is how does this kind of bias supposedly affects de novo assembly?

Thank you

Illumina De-novo-assembly Assembly RNA-Seq Bias • 4.7k views
1
Entering edit mode
8.0 years ago

That's interesting. From the match rates you've displayed, I wonder if maybe your assembly would be just as good if you trimmed the first 2bp rather than the first 12bp. If the lower continuity of the untrimmed data is coming from error bases, that should wipe most of it out.

Also, the read 2 base frequency histogram looks like the read 1 histogram just shifted by 1bp (read 1 has the T peak at position 7, read 2 has it at position 8, etc). That, coupled with the fact that the first base of read 2 has a ~25% match rate - which you would expect from adapter sequence - makes me wonder if r2 sequencing is starting at the wrong location and the first base is actually adapter.

1
Entering edit mode

Hello,

I just finished a new assembly, trimming the first 2bp. Now I have this:

                      contigs     N50      assembled bases
without trimming      269491      1083     165837915
trimming first 2      259761      1119     162273830
trimming first 12     171249      1270     133008734


Trimming the first 2bp had little effect. It's quite interesting. It really seems that the first 12 bases have 'equal' value lowering the continuity of the assembly, even with high match rates (~98%). I wonder if I trim, let's say, the first 20, N50 will continue to rise or it only happens when I trim the biased bases.

0
Entering edit mode

Thanks for sharing that; it's clearly not what I expected. I will suggest to my team that we look into the effects on our assemblies of trimming these bases.