**13k**wrote:

I am frequently faced with variant files of either 0-based or 1-based coordinates (or in the worst case, mixed!) and having to determine which I am looking at and how to convert between them. I usually go to the white board and work it out. This time, I figured I would just create a digital copy.

First, a diagram to help illustrate:

The example above shows (an imaginary) first seven nucleotides of sequence on chromosome 1:

- 1-based coordinate system
- Numbers nucleotides directly

- 0-based coordinate system
- Numbers between nucleotides

To indicate a single nucleotide or variant:

- 1-based coordinate system
- Single nucleotides, variant positions, or ranges are specified directly by their corresponding nucleotide numbers

- 0-based coordinate system
- Single nucleotides, variant positions, or ranges are specified by the coordinates that flank them

To indicate a deletion or insertion:

- 1-based coordinate system
- Deletions are specified directly by the positions of the deleted bases
- Insertions are indicated by the coordinates of the bases that flank the insertion

- 0-based coordinate system
- Deletions are specified by the coordinates that flank the deleted bases
- Insertions are indicated directly by the coordinate position where the insertion occurs

Why does all this matter?

- Moving from UCSC browser/tools to Ensembl browser/tools or back
- Ensembl uses 1-based coordinate system
- UCSC uses 0-based coordinate system

- Some file formats are 1-based (GFF, SAM, VCF) and others are 0-based (BED, BAM)
- See this excellent post and its many good links for more info:

Pseudo-code to convert variants from 0-based coordinates to 1-based coordinates:

```
if (type=SNV){start=start+1; end=end;}
if (type=DEL){start=start+1; end=end;}
if (type=INS){start=start; end=end+1;}
```

Pseudo-code to convert variants from 1-based coordinates to 0-based coordinates:

```
if (type=SNV){start=start-1; end=end;}
if (type=DEL){start=start-1; end=end;}
if (type=INS){start=start; end=end-1;}
```

**3.4k**• written 2.8 years ago by Obi Griffith ♦

**13k**

GBrowse and GFF are 1-based exclusively.

50While BAM is 0-based, once you pull it out to something human-readable it can get turned into 1-based data (SAM). Great post, btw.

15kI'm not sure your description of an insertion in 1-based coordinates is canonical. I'm actually not sure there is a canonical description of insertions in 1-based coordinate range, which is why many low-level tools prefer to work in 0-based coordinates.

One of the primary advantages of 0-based coordinates is that the width of a feature is always 'end-start'. Insertions have 0-width, and hence they have the same start/end.

Your representation in 1-based coordinates makes the insertion appear as if it has width of 2! I'm not sure there is a better representation, but more programmatically friendly is to describe it as chr1:5-4 (yes, start > end). It gives you the advantage of the width of the feature always being (end-start+1) and translation from 0 to 1 base only require incrementing the 'start' for ALL variant types.

Because of these problems, I would argue for variants, you should not represent them in 1-based coordinates as a range, but just as the start coordinate only.

220The gff3 spec indicates insertion sites should have start = end and the insertion occurs to the right of the coordinate. It's a little goofy but at least it's a "standard".

I agree with regard to zero based coordinates: both chado and jbrowse use zero based internally.

50Yes. Most of what you say is absolutely true. Except I'm not sure about your last point. How would you indicate a multi-bp deletion in 1-based with only a start coordinate? In any case, the point of this cheat sheet was not to propose a standard but merely to describe the real world problem. As someone working at a genome center and dealing with a large variety of standard and non-standard variant files it is quite common to see both 1-based and 0-based variant files with chr:start-stop format whether that makes sense or not. The main thing is to realize that you need to think about this issue before blindly passing a variant file to any piece of software.

13kCompletely agree that there is under-appreciated complexity in different and (sometimes conflicting) representations used by different tools.

My last point is simply this: 1-based coordinates are inadequate to describe variants as "features" as a pair of start/stop 1-based indexes (because of the issue of insertions having 0-width). Yet of course, we sometimes are faced with the need to attempt to do so, and I actually prefer the (start+1,end) representation precisely as it gives the impression that you are representing something unnatural in 1-based coordinates when you see chr1:5-4.

It also is capable of being round-trip converted back to 0-based coordinates without knowledge of what type of variant the coordinates represent as it is variant type agnostic.

But for human readable consumption, 1-based is preferred and my preference is to write variants then with just the start 1-based and use something akin to (or equivalent of) HGVS g. notation.

For example:

When representing variants in a format intended for programs to parse though, 0-based intervals are always preferred.

220and some API are 1-Based ! e.g: the java picard library for BAM use 1-based indexes :-)

84kFor god sake, check the TAIR10 Gbrowser http://tairvm17.tacc.utexas.edu/cgi-bin/gb2/gbrowse/arabidopsis/ , it just like 0-based/1-based mixture to me (see post Help: TAIR10 GBrowser 1-based mixed with 0-based? ). Luckily we got IGV. (p.s. Plant research cannot triumph mammalian research, look at ucsc genome browser. )

280I came back to this which I had bookmarked but all the images have disappeared. Any chance of bringing them back?

3.0kImages are still there. Browser issues perhaps?

13kThe images are blurred...

1.6kSomething wrong about Pseudo-code of converting INS between 0-based and 1-based. In 0-based system, INS coordinate is usually in the format of [X,X), where X is a coordinate. In 0-based, [X,X) INS means sequence is inserted before X, while X is not include. In 1-based system, there is no such mechanism, so you have to convert to [X+1,X+1] to indicate that the sequence is inserted before X+1. If you convert to [X,X+1] as you described above, it makes no sense.

The same thing happened when converting from 1 to 0-based.

So to convert INS from 0 to 1-based system:

start = start + 1; end = end + 1and to convert INS from 1 to 0-based system:

start = start - 1; end = end - 1For a further discussion, INS is a very tricky case. Some program does not recognize coordinate like [X,X) in 0-based. So there is another representation of [X,X+1) which also indicate that sequence is inserted before coordinate X. In this situation, the converting will be as follows:

to convert INS from 0 to 1-based system:

start = start + 1; end = endto convert INS from 1 to 0-based system:

start = start - 1; end = endFor me I vote the second representation, which will unify the converting method.

But I still did not come up with the case you offered above about INS converting.

370