Hi. I just have a quick question in regards to calculating the N50 for a list of contigs. Anybody know how to do it in excel, by any chance? I'm not much of a programmer just yet, and I was just wondering if there was an easier way than just constructing a python program or something.

Get the lengths of your contigs into a column, and sort that column. Sum over all lengths to get the total number bases N, and then just traverse your list of lengths from top to bottom until the cumulative sum reaches N/2. (You could use an additional column to keep track of the running sum if that makes it easier.)

The length of the contig where your running sum passes N/2 is the N50. (In the case where the running sum exactly equals N/2, then the N50 is the average of this length and the subsequent one in the list.)

For example, below, the cumulative sum exceeds 44/2=22 on a contig with length 6, so N50 = 6.

contigs length cumulative sum
AAAAAAA 7 7
AAAAAA 6 13
AAAAAA 6 19
AAAAAA 6 25
AAAA 4 29
AAAA 4 33
AAA 3 36
AAA 3 39
AA 2 41
AA 2 43
A 1 44

Following Pierre's advice (actually Excel's paperclip), in R this is not much more complicated than in Excel. Say your contig sequences are in file contigs.txt, one per line, first line is the header:

dat<- read.table('contigs.txt', header= TRUE, stringsAsFactors= FALSE)
dat$len<- nchar(dat$contigs)
dat<- dat[order(dat$len, decreasing= TRUE),]
dat$cum_sum<- cumsum(dat$len)
# Where is N50?
dat[dat$cum_sum > max(dat$cum_sum)/2,][1,]
contigs len cum_sum
4 AAAAAA 6 25

This might be an occasion to get started with R, it will pay off...

While it's a good way to do it, and R is going to open up loads of doors, your answer will look like gibberish to someone who's never used it before without having any comments or reasoning behind the lines.

Hell, when I started in R after years of perl all the trailing comas made no sense to me whatsoever. Still not sure I full understand the reasoning...

Fair enough, it does look cryptic, sorry... But my answer was not supposed to be a starter to R. The point I tried to make, I guess, is that with few lines one can get things that might appear quite complicated at first impression.

I have coded pretty much exactly the instructions thatÂ leekaiinthesky has posted. If one follows his/her explanation along with the code above, I think is fairly understandable what each command does. Then of course, quirks like trailing commas and square brackets still look weird, but one gets the idea hopefully...

via t@tim_yates