Question: Genome assembly N50
0
9 months ago by
Inquisitive899550 wrote:

Hi, I am looking for a better explanation for N50 in genome assembly. As per my understanding, N50 is the length of the contigs which covers 50% of genome. Am I right ? Also, say for example, if I have 2 tools which give N50 as 500 and 1000 respectively, which of these would be a better tool ? Thanks.

assembly tools n50 assembly • 1.0k views
modified 9 months ago by lieven.sterck4.1k • written 9 months ago by Inquisitive899550
1

What is Wrong with N50? How can we make it better?

2
9 months ago by
lieven.sterck4.1k
VIB, Ghent, Belgium
lieven.sterck4.1k wrote:

Your definition/understanding of N50 is somewhat correct indeed.

the way you calculate N50 is : you order your contigs large to small, then you start making the cumulative sum of the lengths of the contigs until you have >50% of your assembly , that contig is the L50, and it's length is N50.

Intuitively one should go for the assembly with the highest N50 (1000 in this case), but N50 alone is not a good measure of performance, also total assembled size etc are of importance (NG50 might help here a little).

1

OP's definition of N50 is incorrect. N50 is the length of the shortest contig that together with all the contigs of the assembly that are the same length or longer than it cover 50% of the genome assembly

indeed, an assumption I made that might not have been totally clear, so I elaborated on it

Thanks for your reply. I came across a definition which said N50 is the "weighted median statistic" - What does this mean ? Is N50 also described as number of contigs above the median contig ?

1

well in essence it is something like a weighted median stat indeed.

No, that would be L50 : the number of contigs representing 50% of the assembly (where N50 is the actual length of the L50 contig)

the way you calculate N50 is : you order your contigs large to small, then you start making the cumulative sum of the lengths of the contigs until you have >50% of your assembly , that contig is the L50, and it's length is N50 .

Keep in mind that this is in reference to the actual assembly, not the estimated genome size (that would then be NG50 & LG50 )

No contig is L50. Wikipedia puts it well:

L50 count is defined as the smallest number of contigs whose length sum produces N50