Tutorial: Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems
93
gravatar for Obi Griffith
2.9 years ago by
Obi Griffith14k
Washington University, St Louis, USA
Obi Griffith14k wrote:

I am frequently faced with variant files of either 0-based or 1-based coordinates (or in the worst case, mixed!) and having to determine which I am looking at and how to convert between them. I usually go to the white board and work it out. This time, I figured I would just create a digital copy.

First, a diagram to help illustrate:

enter image description here

The example above shows (an imaginary) first seven nucleotides of sequence on chromosome 1:

  • 1-based coordinate system
    • Numbers nucleotides directly
  • 0-based coordinate system
    • Numbers between nucleotides

To indicate a single nucleotide or variant:

enter image description here

  • 1-based coordinate system
    • Single nucleotides, variant positions, or ranges are specified directly by their corresponding nucleotide numbers
  • 0-based coordinate system
    • Single nucleotides, variant positions, or ranges are specified by the coordinates that flank them

To indicate a deletion or insertion:

enter image description here

  • 1-based coordinate system
    • Deletions are specified directly by the positions of the deleted bases
    • Insertions are indicated by the coordinates of the bases that flank the insertion
  • 0-based coordinate system
    • Deletions are specified by the coordinates that flank the deleted bases
    • Insertions are indicated directly by the coordinate position where the insertion occurs

Why does all this matter?

Pseudo-code to convert variants from 0-based coordinates to 1-based coordinates:

if (type=SNV){start=start+1; end=end;}
if (type=DEL){start=start+1; end=end;}
if (type=INS){start=start; end=end+1;}

Pseudo-code to convert variants from 1-based coordinates to 0-based coordinates:

if (type=SNV){start=start-1; end=end;}
if (type=DEL){start=start-1; end=end;}
if (type=INS){start=start; end=end-1;}
coordinates tutorial • 17k views
ADD COMMENTlink modified 10 months ago by Shicheng Guo3.4k • written 2.9 years ago by Obi Griffith14k
2

GBrowse and GFF are 1-based exclusively.

ADD REPLYlink written 2.9 years ago by scott50
1

While BAM is 0-based, once you pull it out to something human-readable it can get turned into 1-based data (SAM). Great post, btw.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Alex Reynolds15k
1

I'm not sure your description of an insertion in 1-based coordinates is canonical. I'm actually not sure there is a canonical description of insertions in 1-based coordinate range, which is why many low-level tools prefer to work in 0-based coordinates.

One of the primary advantages of 0-based coordinates is that the width of a feature is always 'end-start'. Insertions have 0-width, and hence they have the same start/end.

Your representation in 1-based coordinates makes the insertion appear as if it has width of 2! I'm not sure there is a better representation, but more programmatically friendly is to describe it as chr1:5-4 (yes, start > end). It gives you the advantage of the width of the feature always being (end-start+1) and translation from 0 to 1 base only require incrementing the 'start' for ALL variant types.

Because of these problems, I would argue for variants, you should not represent them in 1-based coordinates as a range, but just as the start coordinate only.

ADD REPLYlink written 2.9 years ago by Gabe Rudy220
1

The gff3 spec indicates insertion sites should have start = end and the insertion occurs to the right of the coordinate. It's a little goofy but at least it's a "standard".

I agree with regard to zero based coordinates: both chado and jbrowse use zero based internally.

ADD REPLYlink written 2.9 years ago by scott50

Yes. Most of what you say is absolutely true. Except I'm not sure about your last point. How would you indicate a multi-bp deletion in 1-based with only a start coordinate? In any case, the point of this cheat sheet was not to propose a standard but merely to describe the real world problem. As someone working at a genome center and dealing with a large variety of standard and non-standard variant files it is quite common to see both 1-based and 0-based variant files with chr:start-stop format whether that makes sense or not. The main thing is to realize that you need to think about this issue before blindly passing a variant file to any piece of software.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Obi Griffith14k
2

Completely agree that there is under-appreciated complexity in different and (sometimes conflicting) representations used by different tools.

My last point is simply this: 1-based coordinates are inadequate to describe variants as "features" as a pair of start/stop 1-based indexes (because of the issue of insertions having 0-width). Yet of course, we sometimes are faced with the need to attempt to do so, and I actually prefer the (start+1,end) representation precisely as it gives the impression that you are representing something unnatural in 1-based coordinates when you see chr1:5-4.

It also is capable of being round-trip converted back to 0-based coordinates without knowledge of what type of variant the coordinates represent as it is variant type agnostic.

But for human readable consumption, 1-based is preferred and my preference is to write variants then with just the start 1-based and use something akin to (or equivalent of) HGVS g. notation.

For example:

  • chr1:5insT (insertion of a T between positions 4 and 5)
  • chr1:5C>T (change of reference C to T)
  • chr1:5delTTC (3bp deletion of TTC)

When representing variants in a format intended for programs to parse though, 0-based intervals are always preferred.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Gabe Rudy220

and some API are 1-Based ! e.g: the java picard library for BAM use 1-based indexes :-)

ADD REPLYlink written 2.9 years ago by Pierre Lindenbaum85k

For god sake, check the TAIR10 Gbrowser http://tairvm17.tacc.utexas.edu/cgi-bin/gb2/gbrowse/arabidopsis/ , it just like 0-based/1-based mixture to me (see post Help: TAIR10 GBrowser 1-based mixed with 0-based? ). Luckily we got IGV. (p.s. Plant research cannot triumph mammalian research, look at ucsc genome browser. )

ADD REPLYlink written 2.9 years ago by Puriney280

I came back to this which I had bookmarked but all the images have disappeared. Any chance of bringing them back?

ADD REPLYlink written 2.6 years ago by Daniel3.0k

Images are still there. Browser issues perhaps?

ADD REPLYlink written 2.6 years ago by Obi Griffith14k

The images are blurred...

ADD REPLYlink written 16 months ago by tangming20051.7k

Something wrong about Pseudo-code of converting INS between 0-based and 1-based. In 0-based system, INS coordinate is usually in the format of [X,X), where X is a coordinate. In 0-based, [X,X) INS means sequence is inserted before X, while X is not include. In 1-based system, there is no such mechanism, so you have to convert to [X+1,X+1] to indicate that the sequence is inserted before X+1. If you convert to [X,X+1] as you described above, it makes no sense.

The same thing happened when converting from 1 to 0-based.

So to convert INS from 0 to 1-based system:

start = start + 1; end = end + 1

and to convert INS from 1 to 0-based system:

start = start - 1; end = end - 1

For a further discussion, INS is a very tricky case. Some program does not recognize coordinate like [X,X) in 0-based. So there is another representation of [X,X+1) which also indicate that sequence is inserted before coordinate X. In this situation, the converting will be as follows:

to convert INS from 0 to 1-based system:

start = start + 1; end = end

to convert INS from 1 to 0-based system:

start = start - 1; end = end

For me I vote the second representation, which will unify the converting method.

But I still did not come up with the case you offered above about INS converting.

ADD REPLYlink modified 14 months ago • written 14 months ago by Chen Sun370
0
gravatar for Shicheng Guo
10 months ago by
Shicheng Guo3.4k
United States/San Diego/UCSD
Shicheng Guo3.4k wrote:

 

It doesn't make sense to create such two coordinate systems.  One unified system would make it easy to use and do not make us to make mistake. 

ADD COMMENTlink written 10 months ago by Shicheng Guo3.4k
2

That ship has long ago sailed (heck, even fortran and C differ in whether to use 0 or 1 based indexing by default).

ADD REPLYlink written 10 months ago by Devon Ryan55k

Yes, you are right. perl and python use 0 as index of the array.Anyway, our programmer need to remember all the traps. Maybe it is a natural barrier to stop the non-programmer enemy come to our field easily. LoL. Anyway. I like it.

ADD REPLYlink written 10 months ago by Shicheng Guo3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 947 users visited in the last hour