Question: Very Few Reads Mapping Back To Contigs - Plant Transcriptome
I assembled plant transcriptome 454 data (non normalised) using trinity after the following

1)pre processing (removal of adaptors, vector contamination) 2)removal of rRna sequences 3)removal of chloroplast and mitochondrial genes using bwa

From 3,70,929 reads, i got 21,486 contigs. When i mapped the reads to the contigs using bwa, only 44,678 reads were used in the assembly. What am i doing wrong here? I randomly blasted the contigs to observe that they share over 90% similarity with related legume proteins (although many were hypothetical). However, only a small percentage of the contigs align to the transcript assemblies of related legumes when mapped using bwa.

The velvet assembly of the same data resulted in 15,323 contigs with lesser n50 value, n90 value, max length etc. MIRA assembly resulted in more contigs and more reads being used but lesser n50, n90 and avg length of contig. Why are only 44,678 reads being used? Any advice is greatly appreciated.

plant rna mapping bwa read • 2.4k views
Do you mean 370k reads or 3 million? That would have a big impact on interpreting your read usage. Also, I agree with (22308)3 that Newbler would be a good tool of choice for your data.

According to one of key developers of Trinity - Brian J. Haas' option:

"Ultimately, Trinity might not be the best tool for assembling 454 data, since coverage won't be anywhere near what is expected from Illumina in most cases, and Trinity exploits the high coverage data as part of reconstructing transcripts. The current version of Newbler is supposed to work especially well for 454 transcriptome data, so I encourage you to give that a try if you haven't already."

I would try Newbler 2.6 if you have access to it. Use bwasw to map 454 reads to contigs.

I did try Newbler. However, Newbler generated only 9494 isotigs out of 2,50,000 reads. Although, the N50 value, size of contigs and other metrics are quite positive. I am going to BLASTx the entire set of contigs from the three assemblers to the proteomes of related species and the NR databases; allowing the results to determine the best assembly. Any other strategy is hugely appreciated.

