Question: What Are The Advantages/Disadvantages Of One-Based Vs. Zero-Based Genome Coordinate Systems
19
gravatar for Casey Bergman
8.7 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

One of the most common gotchas I encounter introducing students to bioinformatics is the off-by-one coordinate shift problem(s) that arise when switching between one-based (e.g. BLAST) and zero-based (e.g. UCSC) genome coordinate systems.

I have yet to find a clear exposition of the differences between these two major coordinate systems (and their minor variants), and have tried to discuss the differences in a past blog post, but I don't feel confident I've covered all the bases on this issue.

The fact that this is not an obvious problem to all has come up in recent a BioStar post and comments, and I was hoping that we could use this forum to discuss the relative merits of both systems.

genome coordinates • 15k views
ADD COMMENTlink modified 5.0 years ago by Biostar ♦♦ 20 • written 8.7 years ago by Casey Bergman18k
6

https://twitter.com/#!/dasmoth/status/42189749825449985

"If it doesn't have off-by-one errors, it isn't bioinformatics."

ADD REPLYlink modified 12 weeks ago by RamRS25k • written 8.7 years ago by Pierre Lindenbaum124k
35
gravatar for Aaronquinlan
8.7 years ago by
Aaronquinlan11k
United States
Aaronquinlan11k wrote:

0-based, half open systems allow cheap length calculations. That is, m-n instead of (m-n)+1 in a 1-based, closed system. Also, 0-based is convenient for programming; most widely-used programming languages use 0-based arrays. Another example is calculating overlap. To calculate the degree of overlap between two 0-based, half-open intervals, you can use the following:

a = [start1, end1)
b = [start2, end2)
overlap(a,b) = min(end1,end2) - max(start1,start2)

whereas with a one-based system it is:

a = [start1, end1]
b = [start2, end2]
overlap(a,b) = min(end1,end2) - max(start1,start2) + 1

The beauty of the above approach with 0-based is that if two intervals do not overlap, then the recipe will return a negative value whose absolute value is the distance between the two features.

So, for programming, I much prefer 0-based, as it prevents tons of extra (ugly and more expensive) -1 and +1 operations in one's code.

The counter argument is that our brains are trained to think in 1-based, closed systems. I suspect the designers of various formats such as BED (0-based), BAM (0-based), VCF (1-based), and GFF (1-based) made conscious decisions regarding the coordinate system based on the intent of the format. For example, BED is a fundamental format in the UCSC browser and much of the underlying code depends on it. Thus, the coordinate system is 0-based for speed and code cleanliness. Similarly, BAM requires efficiency. In contrast, perhaps the designed of VCF and GFF were more concerned with "readability" of the format?

ADD COMMENTlink modified 12 weeks ago by RamRS25k • written 8.7 years ago by Aaronquinlan11k
4

And in a zero-based system, the start/end/length calculations still work for sequence features that pass across the origin of circular sequences.

ADD REPLYlink written 8.7 years ago by iw9oel_ad6.0k
2

SAM is 1-based. BAM is 0-based.

ADD REPLYlink written 7.8 years ago by Aaronquinlan11k
1

I more like to use [start1,end1) in the 0-based system. It is a different interpretation, but also a confusion. The 1-based coordinate has no such ambiguity.

ADD REPLYlink written 8.7 years ago by lh331k
1

Great answer. I don't buy the "our brains are trained to think in 1-based, closed systems" argument, though. That may be true, but I don't think it's relevant. In my experience, it's rare to have features that start at the origin. That means that human beings hardly ever have to count from the origin in bioinformatics; it's always software that's doing it. So we should choose coordinate systems that make it easier for software.

ADD REPLYlink written 8.7 years ago by Mitch Skinner660
1

Just to add that SAM/BAM is one-based, not zero-based. See http://samtools.sourceforge.net/SAM1.pdf

I reached this page when googling to find out whether BAM was zero based and got the wrong answer.

ADD REPLYlink written 7.8 years ago by Fidel1.9k

the problem is, regarding widely used formats, BED and GFF, one can just use columns 4 and 5 in GFF file to generate a BED file and then coordinates will be shifted one basepair. An easy to make mistake!

ADD REPLYlink written 8.7 years ago by Alper Yilmaz90

Also: Keith James is totally right. But it's not just for circular genomes: flybase has some features that are in negative coordinates (don't know for sure why; I believe they're chromosome bands that have been mapped to locations before the sequenced region).

ADD REPLYlink written 8.7 years ago by Mitch Skinner660

Also: the "interbase" interpretation of zero-based, half-open intervals makes it easier to describe indels.

ADD REPLYlink written 8.7 years ago by Mitch Skinner660

In BEDTools user manual, (under section 1.3.4) they mentioned BED starts are zero-based and BED ends are one-based. How does this differ from the basic zero-based system?

ADD REPLYlink modified 12 weeks ago by RamRS25k • written 8.1 years ago by Rahul40

Not all widely used programming languages are 0-based. Two that are very commonly used in bioinformatics, that are 1-based, are XSLT and XQuery. (You could count these as only one, since they are both based on XPath, which is where the 1-based arrays are defined.) This list on Wikipedia has a few others.

ADD REPLYlink written 7.7 years ago by Chris Maloney330

Not all widely used programming languages are 0-based. Two that are very commonly used in bioinformatics, that are 1-based, are XSLT and XQuery. (You could count these as only one, since they are both based on XPath, which is where the 1-based arrays are defined.) This list on Wikipedia has a few others.

ADD REPLYlink written 7.7 years ago by Chris Maloney330

Thanks for the ref Chris, edited accordingly.

ADD REPLYlink written 7.7 years ago by Aaronquinlan11k
8
gravatar for Friend
8.1 years ago by
Friend80
Friend80 wrote:

Edsger Dijkstra has something to say about that: http://www.cs.utexas.edu/~EWD/ewd08xx/EWD831.PDF

ADD COMMENTlink written 8.1 years ago by Friend80
2

That was a pleasure to read. Thank you.

ADD REPLYlink written 7.7 years ago by Aaronquinlan11k
3
gravatar for Pascal
7.7 years ago by
Pascal150
Germany
Pascal150 wrote:

Here is a very good explanation about the different coordination conventions.

Also very good is an overview of conventions used by file formats and data bases (from the same blog).

ADD COMMENTlink modified 12 weeks ago by RamRS25k • written 7.7 years ago by Pascal150
1
gravatar for Egon Willighagen
8.7 years ago by
Maastricht
Egon Willighagen5.2k wrote:

I would say instead that what you really intended to ask is, is "what are the disadvantages of index-based coordinate systems". Because to me, zero or one is just a choice and neither more intuitive than the other. Moreover, the one-off problem is something you will have with any starting point, and is not uncommonly caused by the last index, in addition to the first.

More intuitive are solutions like:

foreach (nucleotide : dnaSequence) {
  ...
}
ADD COMMENTlink modified 12 weeks ago by RamRS25k • written 8.7 years ago by Egon Willighagen5.2k

But how do I easily access the 5th to 10th bases?

ADD REPLYlink written 8.7 years ago by Rajarshi Guha880

That's cheating... now your question involved indices... :)

ADD REPLYlink written 8.7 years ago by Egon Willighagen5.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1910 users visited in the last hour