Hi, there
I made a test for virus assembly using Trinity (2.02). The results as followings:
1. normalize before assembly
Trinity --normalize_reads --seqType fa --max_memory 5G --single temp.Flavivirus.fa --CPU 6
prinseq-lite.pl -stats_assembly -fasta ./trinity_out_dir/Trinity.fasta
stats_assembly N50 1190
stats_assembly N75 356
stats_assembly N90 249
stats_assembly N95 231
2. de novo
Trinity --seqType fa --max_memory 5G --single temp.Flavivirus.fa --CPU 6
prinseq-lite.pl -stats_assembly -fasta ./trinity_out_dir/Trinity.fasta
stats_assembly N50 3800
stats_assembly N75 444
stats_assembly N90 248
stats_assembly N95 232
3. ref_guided and normalize
prinseq-lite.pl -stats_assembly -fasta ./trinity_out_dir/Trinity-GG.fasta
stats_assembly N50 835
stats_assembly N75 282
stats_assembly N90 230
stats_assembly N95 217
4. ref_guided assembly
Trinity --genome_guided_bam refguided.sam.sort.bam -max_memory 5G --CPU 6 --genome_guided_max_intron 10000
prinseq-lite.pl -stats_assembly -fasta ./trinity_out_dir/Trinity-GG.fasta
stats_assembly N50 1855
stats_assembly N75 316
stats_assembly N90 242
stats_assembly N95 223
Materials:
A clinical sample were subject to PGM. An in-house pipeline showed the reads file covered 96.41% of the reference genome (gi|428621807|gb|JQ917404.1| Dengue virus 1 isolate RR57
).
Discussion:
In my opinion, the best N50 might be from the reference guided assembly with normalization. In fact, it didnot work as well as I wished. Could you help me figure out why the N50 from denovo were better than reference guided assembly?
Thank you
N50 is a good statistic to measure the quantity of the assembly, but not the quality. Your de novo assembly, with normalization, is probably the most accurate assembly next to your reference-guided one.
Thank you. How to value the quality of assembly? I tried to extract the longest contig from above four strategy and blast them. The results showed no significantly difference from blast results. (Coverage: 100%, Identity: 99%, E-value: down to 0).