Hello.
QUESTION: Can the combination of high levels of pre-mRNA and extreme normalization explain a poor mapping rate?
ISSUE: I am getting poor mapping efficiency (~64%) from my RNAseq reads against a Trinity assembly derived from a normalized subset of those reads. There is little evidence of DNA contamination but substantial evidence of what appears to be pre-mRNA (i.e., I see lots of intron/exon and exon/intron fragments in the assembled contigs and the handful of unaligned read pairs I analyzed manually are mapping to genes but not to Trinity-formed contigs derived from the mRNA from these genes). FastQC analysis suggests substantial sequence duplication. The normalization process in Trinity resulted in only ~10% of the reads being used for the assembly.
BACKGROUND: I have 2x125 PE RNAseq read sets (~80million pairs/set) from my study organism (Genome ~500 MB). FastQC analyses indicate phred>30 for almost 100% of the read length AND a substantial amount of sequence duplication in the data. Trinity normalized each library to ~10% (!) and a little more once my two sets of data were combined (I normalized by set and again after two sets were combined to generate the assembly). The resulting Trinity assembly has ~ 146430 transcripts (67000 'genes').
IDEA: Should I attempt to generate an assembly without normalizing the data to see if I can improve mapping rate and trust that most of the transcripts produced will be poorly supported and so be dropped in later analyses? This is bound to gobble up memory. Any suggestions would be appreciated.
Did you do any QC on your raw reads prior to assembly?
I only did the FastQC analysis reported above in BACKGROUND. I found evidence for adapter contamination but was readily able to remove that using Trimmomatic.
What else would you recommend?
Did you check for ribosomal RNA contamination?