Forum:How to know that our trinity denovo assembly is good or not?
2
0
Entering edit mode
4.3 years ago

I have used Trinity 2.8.4 to do denovo assembly of my plant RNA seq data. Now I have finished assembly, but how to know that this denovo assembly is really good and what are all the stats I should consider?

N50 value is 2865
Total trinity 'genes':72392
Total trinity transcripts:  146848
Median contig length: 1071
Average contig: 1664.26
Total assembled bases: 244393103


Is this looks good?

Assembly Forum • 3.1k views
0
Entering edit mode

"You can see that the Nx values based on the single longest isoform per gene are lower than the Nx stats based on all assembled contigs, as expected, and even though the Nx statistic is really not a reliable indicator of the quality of a transcriptome assembly, the Nx value based on using the longest isoform per gene is perhaps better for reasons described above". (emphasis is mine)

0
Entering edit mode
4.3 years ago
Tm ★ 1.1k

You should not consider N50 while selecting RNASeq assembly as it is not the proper parameter to check in case of RNAseq. You can do assembly with some other assembler also and compare their statistics. But to get a better idea, you should compare at annotation level also.

0
Entering edit mode
4.3 years ago

Using the normal N50 metric for transcriptome assemblies can be highly misleading, as transcriptomes do not strive to achieve long contig lengths and high N50, but instead one contig for each transcript. Furthermore, the most highly expressed transcripts do not necessarily constitute the longest ones and the majority of transcripts in a transcriptome assembly will normally have relatively low expression levels. Check out this discussion on biostars.

Is it true that N50 is not an important parameter for quality in Transcriptome Assembly?

The N50 values can often be exaggerated due to an assembly program generating too many transcript isoforms, especially for the longer transcripts. To mitigate this effect, Trinity assembler also compute the Nx values based on using only the single longest isoform per 'gene':

  ## Stats based on ONLY LONGEST ISOFORM per 'GENE':

Contig N10: 3685
Contig N20: 1718
Contig N30: 909
Contig N40: 588
Contig N50: 439


Go though this paper for methods to evaluate transcriptome assembly

My take is that even for genome assemblies, N50 should be taken with a pinch of salt, as it can mislead the assembly evaluation. If you want to learn more, check out this blog post

Why is N50 used as an assembly metric (and what's the deal with NG50)?