Question: What Does Bwa'S '-N' Parameter Mean?
7
gravatar for Panos
6.6 years ago by
Panos1.6k
Geneva, Switzerland
Panos1.6k wrote:

Although this parameter appears to be BWA's 'main' cutoff (like e-value in BLAST -- correct me if I'm wrong), I can't understand its meaning... Executing 'bwa aln' it gives a short description that says:

-n NUM    max #diff (int) or missing prob under 0.02 err rate (float) [0.04]

What I understand is that its default value is 0.04 but not a lot more than this!

Is the cutoff some kind of an e-value? Or is it based on the percentage of identities of the match?

bwa • 5.8k views
ADD COMMENTlink written 6.6 years ago by Panos1.6k

did you check the docs here: http://bio-bwa.sourceforge.net/bwa.shtml ?

ADD REPLYlink written 6.6 years ago by brentp22k

Yes, but I still don't understand the meaning of the '-n' cutoff! In BLAST, for example, an e-value of 1 means assigned to a hit means that you can get 1 hit with the same score just by chance. What does 0.04 in the -n cutoff mean? What exactly are the "missing alignments"?

ADD REPLYlink written 6.6 years ago by Panos1.6k

I agree, the documentation is lacking here.

ADD REPLYlink written 6.6 years ago by Michael Dondrup44k
6
gravatar for Michael Dondrup
6.6 years ago by
Bergen, Norway
Michael Dondrup44k wrote:

I think I got it, it is a guess still, please correct me: to achieve lambda, the base error rate has to be multiplied by read length. Then when random errors are modeled as Poisson distributed, e.g. given reads of length 100, this will yield an average error rate lambda of 2 per read. (For the Poisson distribution, lambda denotes the rate of discrete events occurring in an interval or unit volume of space, they are not divided by space size because the space volume is present on 'both sides of the equation' it cancels out).

Given the poisson-distribution and readlength=100, what is the probability of observing 2 or more mismatches (n.mismatches) by chance/ for reasons of sequencing error? This can be generally calculated like this (in R):

p = ppois(n.mismatches,lambda=0.02*readlength, lower.tail=FALSE)

ppois(2,lambda=0.02*100, lower.tail=FALSE)
[1] 0.3233236

This is acceptable because the probability to see this or a more extreme outcome is > 0.04. Now, for other values:

ppois(4,lambda=0.02*100, lower=F)
[1] 0.05265302 > 0.04

> ppois(5,lambda=2, lower=F)
[1] 0.01656361 < 0.04

Thus, 0 to 4 mismatches are ok in one read of length 100, but 5 or more are not. While for a read of length 200, that would be only 7 mismatches, because ppois(8,0.02*200) = 0.02136343 < 0.04.

ADD COMMENTlink modified 6.5 years ago • written 6.6 years ago by Michael Dondrup44k
2

I just google-plus-asked Heng Li to look at this, and you appear to be on the right track with this answer...

ADD REPLYlink written 6.5 years ago by Madelaine Gogol5.0k

Thanks to my colleague Pawel btw for the hint to the Poisson distribution.

ADD REPLYlink written 6.6 years ago by Michael Dondrup44k

Thanks for accepting this, but as long as we look it up in the code or an author of BWA confirms, even I cannot be totally sure. Though, I believe this is the only reasonable method fitting the sentences used in the documentation.

ADD REPLYlink written 6.6 years ago by Michael Dondrup44k

That's nice, Madelaine!

ADD REPLYlink written 6.5 years ago by Michael Dondrup44k
4
gravatar for Istvan Albert
6.6 years ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

It is not an E-value - that concept does not apply to a short read aligner - it is a maximum edit distance that is allowed for an alignment to be reported:

... and maximum maxDiff differences are allowed in the whole sequence ...

The integer interpretation is straightforward (seen above)

I would agree that float interpretation is not entirely obvious - but it seems to be a way to specify a variable edit distance in case that the read lengths vary and a fixed distance is inappropriate. As a net effect it generates a different edit distance depending on the read length.

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Istvan Albert ♦♦ 77k
3

Actually, it is more complicated than I thought (the float interpretation in fact). Someone told me that they model the error as Poisson-distribution with uniform base error rate lambda=0.02. Why don't we ask the authors? They should definitely document this better.

ADD REPLYlink written 6.6 years ago by Michael Dondrup44k
1

sorry, I thought I understood it once, I tried to calculate, but the numeric values don't make sense. I thought the 0.04 is like a p-value cutoff (alpha value) to see this number of mismatches under the assumption that the base error is poisson distributed with lambda=0.02, meaning on average there are 2 in 100 mismatches.

ADD REPLYlink written 6.6 years ago by Michael Dondrup44k

So if I have 100 bp reads, a 0.04 cutoff could mean that I can have at most 4 mismatches. Right?

ADD REPLYlink written 6.6 years ago by Panos1.6k

An E-value damn well applies tgo a short read aligner, it's just that so far nobody bothered to calculate it.

ADD REPLYlink written 6.6 years ago by Marvin800

What I meant that the use of E-value as we use it with Blast is not useful during the typical short read mapping since most tools will not generate short partial alignments. All matches would have E-Values far above that of random chance and the actual errors are dominated by various sequencing artifacts - ranking them by the E-values would actually be a bad idea. I suspect that this is why it is not computed in general.

ADD REPLYlink written 6.6 years ago by Istvan Albert ♦♦ 77k
1
gravatar for Marvin
6.6 years ago by
Marvin800
Marvin800 wrote:

Bwa operates under the assumption that every read should actually align. It will, however, not look for very bad alignments. That is a tradeoff between computational effort and the number of missed alignments (the false negative rate). The float parameter to -n is simply the false negative rate you're willing to accept.

When given a false negative rate, bwa then selects appropriate limits on substitutions. To do so, it needs to know how likely a substitution is, and it simply assumes 2%. That is what ends up being described as "the fraction of alignments lost assuming an error rate of 0.02".

ADD COMMENTlink written 6.6 years ago by Marvin800
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 460 users visited in the last hour