Calculating N50 in Excel?
3
1
Entering edit mode
6.9 years ago

Hi. I just have a quick question in regards to calculating the N50 for a list of contigs. Anybody know how to do it in excel, by any chance? I'm not much of a programmer just yet, and I was just wondering if there was an easier way than just constructing a python program or something.

Thanks!

N50 assembly excel • 8.8k views
19
Entering edit mode

via t@tim_yates

6
Entering edit mode
6.9 years ago

Get the lengths of your contigs into a column, and sort that column. Sum over all lengths to get the total number bases N, and then just traverse your list of lengths from top to bottom until the cumulative sum reaches N/2. (You could use an additional column to keep track of the running sum if that makes it easier.)

The length of the contig where your running sum passes N/2 is the N50. (In the case where the running sum exactly equals N/2, then the N50 is the average of this length and the subsequent one in the list.)

For example, below, the cumulative sum exceeds 44/2=22 on a contig with length 6, so N50 = 6.

contigs    length    cumulative sum
AAAAAAA    7         7
AAAAAA     6         13
AAAAAA     6         19
AAAAAA     6         25
AAAA       4         29
AAAA       4         33
AAA        3         36
AAA        3         39
AA         2         41
AA         2         43
A          1         44

0
Entering edit mode

Thanks. I didn't know that when I sum over a column, I can go back and highlight part of the data to determine the sum of that group.

4
Entering edit mode
6.9 years ago

Following Pierre's advice (actually Excel's paperclip), in R this is not much more complicated than in Excel. Say your contig sequences are in file contigs.txt, one per line, first line is the header:

dat<- read.table('contigs.txt', header= TRUE, stringsAsFactors= FALSE)
dat$len<- nchar(dat$contigs)
dat<- dat[order(dat$len, decreasing= TRUE),] dat$cum_sum<- cumsum(dat$len) # Where is N50? dat[dat$cum_sum > max(dat\$cum_sum)/2,][1,]
contigs len cum_sum
4  AAAAAA   6      25


This might be an occasion to get started with R, it will pay off...

0
Entering edit mode

While it's a good way to do it, and R is going to open up loads of doors, your answer will look like gibberish to someone who's never used it before without having any comments or reasoning behind the lines.

Hell, when I started in R after years of perl all the trailing comas made no sense to me whatsoever. Still not sure I full understand the reasoning...

0
Entering edit mode

Fair enough, it does look cryptic, sorry... But my answer was not supposed to be a starter to R. The point I tried to make, I guess, is that with few lines one can get things that might appear quite complicated at first impression.

I have coded pretty much exactly the instructions that leekaiinthesky has posted. If one follows his/her explanation along with the code above, I think is fairly understandable what each command does. Then of course, quirks like trailing commas and square brackets still look weird, but one gets the idea hopefully...

1
Entering edit mode

For what it's worth, here was my solution using your definition of N50:

calculateN50 <- function(values, percent = 50) {
cumulative.sum <- cumsum(sort(values, decreasing = TRUE))
values[cumulative.sum > max(cumulative.sum) * percent / 100][1]
}

1
Entering edit mode
6.0 years ago
Medhat 9.0k

I think this function will do the job for you

N_vlaue <- function(x, nlength = 50, ...){
nlength <- nlength / 100
len.sorted <- rev(sort(x))
N50 <- len.sorted[cumsum(len.sorted) >= sum(len.sorted)*nlength][1]
return(N50)
}


usage

N_vlaue(contigs_length,50)


or

N_vlaue(contigs_length,90) or any value you want