Question: What Are The Advantages/Disadvantages Of One-Based Vs. Zero-Based Genome Coordinate Systems
18
8.0 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

One of the most common gotchas I encounter introducing students to bioinformatics is the off-by-one coordinate shift problem(s) that arise when switching between one-based (e.g. BLAST) and zero-based (e.g. UCSC) genome coordinate systems.

I have yet to find a clear exposition of the differences between these two major coordinate systems (and their minor variants), and have tried to discuss the differences in a past blog post, but I don't feel confident I've covered all the bases on this issue.

The fact that this is not an obvious problem to all has come up in recent a BioStar post and comments, and I was hoping that we could use this forum to discuss the relative merits of both systems.

genome coordinates • 14k views
modified 4.3 years ago by Biostar ♦♦ 20 • written 8.0 years ago by Casey Bergman18k
6

https://twitter.com/#!/dasmoth/status/42189749825449985 "If it doesn't have off-by-one errors, it isn't bioinformatics."

ADD REPLYlink modified 4.9 years ago by Istvan Albert ♦♦ 79k • written 8.0 years ago by Pierre Lindenbaum118k
34
8.0 years ago by
Aaronquinlan10k
United States
Aaronquinlan10k wrote:

0-based, half open systems allow cheap length calculations. That is, m-n instead of (m-n)+1 in a 1-based, closed system. Also, 0-based is convenient for programming; most widely-used programming languages use 0-based arrays. Another example is calculating overlap. To calculate the degree of overlap between two 0-based, half-open intervals, you can use the following:

``````a = [start1, end1)
b = [start2, end2)
overlap(a,b) = min(end1,end2) - max(start1,start2)
``````

whereas with a one-based system it is:

``````a = [start1, end1]
b = [start2, end2]
overlap(a,b) = min(end1,end2) - max(start1,start2) + 1
``````

The beauty of the above approach with 0-based is that if two intervals do not overlap, then the recipe will return a negative value whose absolute value is the distance between the two features.

So, for programming, I much prefer 0-based, as it prevents tons of extra (ugly and more expensive) "-1" and "+1" operations in one's code.

The counter argument is that our brains are trained to think in 1-based, closed systems. I suspect the designers of various formats such as BED (0-based), BAM (0-based), VCF (1-based), and GFF (1-based) made conscious decisions regarding the coordinate system based on the intent of the format. For example, BED is a fundamental format in the UCSC browser and much of the underlying code depends on it. Thus, the coordinate system is 0-based for speed and code cleanliness. Similarly, BAM requires efficiency. In contrast, perhaps the designed of VCF and GFF were more concerned with "readability" of the format?

4

And in a zero-based system, the start/end/length calculations still work for sequence features that pass across the origin of circular sequences.

2

SAM is 1-based. BAM is 0-based.

1

I more like to use [start1,end1) in the 0-based system. It is a different interpretation, but also a confusion. The 1-based coordinate has no such ambiguity.

1

Great answer. I don't buy the "our brains are trained to think in 1-based, closed systems" argument, though. That may be true, but I don't think it's relevant. In my experience, it's rare to have features that start at the origin. That means that human beings hardly ever have to count from the origin in bioinformatics; it's always software that's doing it. So we should choose coordinate systems that make it easier for software.

1

Just to add that SAM/BAM is one-based, not zero-based. See http://samtools.sourceforge.net/SAM1.pdf

I reached this page when googling to find out whether BAM was zero based and got the wrong answer.

the problem is, regarding widely used formats, BED and GFF, one can just use columns 4 and 5 in GFF file to generate a BED file and then coordinates will be shifted one basepair. An easy to make mistake!

Also: Keith James is totally right. But it's not just for circular genomes: flybase has some features that are in negative coordinates (don't know for sure why; I believe they're chromosome bands that have been mapped to locations before the sequenced region).

Also: the "interbase" interpretation of zero-based, half-open intervals makes it easier to describe indels.

In BEDTools user manual(http://bedtools.googlecode.com/files/BEDTools-User-Manual.v4.pdf), (under section 1.3.4 )they mentioned BED starts are zero-based and BED ends are one-based. How does this differ from the basic zero-based system?

Not all widely used programming languages are 0-based. Two that are very commonly used in bioinformatics, that are 1-based, are XSLT and XQuery. (You could count these as only one, since they are both based on XPath, which is where the 1-based arrays are defined.) This list on Wikipedia has a few others.

Not all widely used programming languages are 0-based. Two that are very commonly used in bioinformatics, that are 1-based, are XSLT and XQuery. (You could count these as only one, since they are both based on XPath, which is where the 1-based arrays are defined.) This list on Wikipedia has a few others.

Not all widely used programming languages are 0-based. Two that are very commonly used in bioinformatics that are 1-based are XSLT and XQuery. (Although, you could say these only count as one, since they are both based on XPath, which is where the 1-based arrays are defined.) This list on Wikipedia, http://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28array%29#Array%5Fsystem%5Fcross-reference%5Flist, has a few others.

Thanks for the ref Chris, edited accordingly.

8
7.4 years ago by
Friend80
Friend80 wrote:

Edsger Dijkstra has something to say about that: http://www.cs.utexas.edu/~EWD/ewd08xx/EWD831.PDF

2

That was a pleasure to read. Thank you.

3
7.0 years ago by
Pascal150
Germany
Pascal150 wrote:

Here is a very good explanation about the different coordination conventions:

http://alternateallele.blogspot.de/2012/03/genome-coordinate-conventions.html

Also very good is an overview of conventions used by file formats and data bases (from the same blog):

http://alternateallele.blogspot.de/2012/03/genome-coordinate-cheat-sheet.html

1
8.0 years ago by
Maastricht
Egon Willighagen5.2k wrote:

I would say instead that what you really intended to ask is, is "what are the disadvantages of index-based coordinate systems". Because to me, zero or one is just a choice and neither more intuitive than the other. Moreover, the one-off problem is something you will have with any starting point, and is not uncommonly caused by the last index, in addition to the first.

More intuitive are solutions like:

``````foreach (nucleotide : dnaSequence) {
...
}
``````

But how do I easily access the 5th to 10th bases?