8.7 years ago by

United States

0-based, half open systems allow cheap length calculations. That is, `m-n`

instead of `(m-n)+1`

in a 1-based, closed system. Also, 0-based is convenient for programming; most widely-used programming languages use 0-based arrays. Another example is calculating overlap. To calculate the degree of overlap between two 0-based, half-open intervals, you can use the following:

```
a = [start1, end1)
b = [start2, end2)
overlap(a,b) = min(end1,end2) - max(start1,start2)
```

whereas with a one-based system it is:

```
a = [start1, end1]
b = [start2, end2]
overlap(a,b) = min(end1,end2) - max(start1,start2) + 1
```

The beauty of the above approach with 0-based is that if two intervals *do not* overlap, then the recipe will return a negative value whose absolute value is the distance between the two features.

So, for programming, I much prefer 0-based, as it prevents tons of extra (ugly and more expensive) `-1`

and `+1`

operations in one's code.

The counter argument is that our brains are trained to *think* in 1-based, closed systems. I suspect the designers of various formats such as BED (0-based), BAM (0-based), VCF (1-based), and GFF (1-based) made conscious decisions regarding the coordinate system based on the intent of the format. For example, BED is a fundamental format in the UCSC browser and much of the underlying code depends on it. Thus, the coordinate system is 0-based for speed and code cleanliness. Similarly, BAM requires efficiency. In contrast, perhaps the designed of VCF and GFF were more concerned with "readability" of the format?

https://twitter.com/#!/dasmoth/status/42189749825449985

25k• written 8.7 years ago by Pierre Lindenbaum ♦124k