Tutorial: Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems
103
gravatar for Obi Griffith
3.4 years ago by
Obi Griffith14k
Washington University, St Louis, USA
Obi Griffith14k wrote:

I am frequently faced with variant files of either 0-based or 1-based coordinates (or in the worst case, mixed!) and having to determine which I am looking at and how to convert between them. I usually go to the white board and work it out. This time, I figured I would just create a digital copy.

In addition to the illustrations/explanation below, Ben Ainscough (a member of our lab) has created a python tool to convert between zero and one based coordinate systems here: https://github.com/griffithlab/convert_zero_one_based

First, a diagram to help illustrate:

enter image description here

The example above shows (an imaginary) first seven nucleotides of sequence on chromosome 1:

  • 1-based coordinate system
    • Numbers nucleotides directly
  • 0-based coordinate system
    • Numbers between nucleotides

To indicate a single nucleotide or variant:

enter image description here

  • 1-based coordinate system
    • Single nucleotides, variant positions, or ranges are specified directly by their corresponding nucleotide numbers
  • 0-based coordinate system
    • Single nucleotides, variant positions, or ranges are specified by the coordinates that flank them

To indicate a deletion or insertion:

enter image description here

  • 1-based coordinate system
    • Deletions are specified directly by the positions of the deleted bases
    • Insertions are indicated by the coordinates of the bases that flank the insertion
  • 0-based coordinate system
    • Deletions are specified by the coordinates that flank the deleted bases
    • Insertions are indicated directly by the coordinate position where the insertion occurs

Why does all this matter?

Pseudo-code to convert variants from 0-based coordinates to 1-based coordinates:

if (type=SNV){start=start+1; end=end;}
if (type=DEL){start=start+1; end=end;}
if (type=INS){start=start; end=end+1;}

Pseudo-code to convert variants from 1-based coordinates to 0-based coordinates:

if (type=SNV){start=start-1; end=end;}
if (type=DEL){start=start-1; end=end;}
if (type=INS){start=start; end=end-1;}
coordinates tutorial • 23k views
ADD COMMENTlink modified 10 weeks ago • written 3.4 years ago by Obi Griffith14k
2

GBrowse and GFF are 1-based exclusively.

ADD REPLYlink written 3.4 years ago by scott50
2

I'm not sure your description of an insertion in 1-based coordinates is canonical. I'm actually not sure there is a canonical description of insertions in 1-based coordinate range, which is why many low-level tools prefer to work in 0-based coordinates.

One of the primary advantages of 0-based coordinates is that the width of a feature is always 'end-start'. Insertions have 0-width, and hence they have the same start/end.

Your representation in 1-based coordinates makes the insertion appear as if it has width of 2! I'm not sure there is a better representation, but more programmatically friendly is to describe it as chr1:5-4 (yes, start > end). It gives you the advantage of the width of the feature always being (end-start+1) and translation from 0 to 1 base only require incrementing the 'start' for ALL variant types.

Because of these problems, I would argue for variants, you should not represent them in 1-based coordinates as a range, but just as the start coordinate only.

ADD REPLYlink written 3.4 years ago by Gabe Rudy290
1

The gff3 spec indicates insertion sites should have start = end and the insertion occurs to the right of the coordinate. It's a little goofy but at least it's a "standard".

I agree with regard to zero based coordinates: both chado and jbrowse use zero based internally.

ADD REPLYlink written 3.4 years ago by scott50
1

In ENSEMBL they treat the INS giving start>end: http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#vcf

The following examples illustrate how VCF describes a variant and how it is handled internally by VEP. Consider the following aligned sequences (for the purposes of discussion on chromosome 20):

Ref: a t C g a // C is the reference base
1 : a t G g a // C base is a G in individual 1
2 : a t - g a // C base is deleted w.r.t. the reference in individual 2
3 : a t CAg a // A base is inserted w.r.t. the reference sequence in individual 3

Individual 3

The third individual has an "A" inserted between the 3rd and 4th bases of the sequence relative to the reference. In VCF, as for the deletion, the base before the insertion is included in both the reference and variant allele columns, and the reported position is that of the preceding base:

20   3   .   C   CA   .   PASS   .

In Ensembl format, again the preceding base is not included, and the start/end positions are "swapped" to indicate that this is an insertion. Similarly to a deletion, a "-" is used to indicate no sequence in the reference:

 20   4   3   -/A   +

Again, the output will appear different, and the constructed identifier may not be what is expected:

20_3_-/A

The solution is to always add a unique identifer for each of your variants to the VCF file, or use VCF as your output format.

ADD REPLYlink modified 3 months ago • written 3 months ago by Pablo Marin-Garcia1.7k

Yes. Most of what you say is absolutely true. Except I'm not sure about your last point. How would you indicate a multi-bp deletion in 1-based with only a start coordinate? In any case, the point of this cheat sheet was not to propose a standard but merely to describe the real world problem. As someone working at a genome center and dealing with a large variety of standard and non-standard variant files it is quite common to see both 1-based and 0-based variant files with chr:start-stop format whether that makes sense or not. The main thing is to realize that you need to think about this issue before blindly passing a variant file to any piece of software.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Obi Griffith14k
3

Completely agree that there is under-appreciated complexity in different and (sometimes conflicting) representations used by different tools.

My last point is simply this: 1-based coordinates are inadequate to describe variants as "features" as a pair of start/stop 1-based indexes (because of the issue of insertions having 0-width). Yet of course, we sometimes are faced with the need to attempt to do so, and I actually prefer the (start+1,end) representation precisely as it gives the impression that you are representing something unnatural in 1-based coordinates when you see chr1:5-4.

It also is capable of being round-trip converted back to 0-based coordinates without knowledge of what type of variant the coordinates represent as it is variant type agnostic.

But for human readable consumption, 1-based is preferred and my preference is to write variants then with just the start 1-based and use something akin to (or equivalent of) HGVS g. notation.

For example:

  • chr1:5insT (insertion of a T between positions 4 and 5)
  • chr1:5C>T (change of reference C to T)
  • chr1:5delTTC (3bp deletion of TTC)

When representing variants in a format intended for programs to parse though, 0-based intervals are always preferred.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Gabe Rudy290
2

You write that UCSC is 0-based. As has been mentioned above, the web-interface of the UCSC browser is 1-based, as it meant to be used by biologists, but all internal representations (text files, database tables, binary files) are 0-based, as they're mostly used by programmers.

Unfortunately, for historical reasons, there two exceptions, two formats (the wiggle and bigWig text file and database formats) are 1-based.

ADD REPLYlink modified 4 months ago • written 4 months ago by Maximilian Haeussler1.2k

see here https://genome.ucsc.edu/FAQ/FAQtracks#tracks1 Database/browser start coordinates differ by 1 base

ADD REPLYlink written 4 months ago by tangming20051.9k
1

While BAM is 0-based, once you pull it out to something human-readable it can get turned into 1-based data (SAM). Great post, btw.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Alex Reynolds17k

and some API are 1-Based ! e.g: the java picard library for BAM use 1-based indexes :-)

ADD REPLYlink written 3.4 years ago by Pierre Lindenbaum91k

For god sake, check the TAIR10 Gbrowser http://tairvm17.tacc.utexas.edu/cgi-bin/gb2/gbrowse/arabidopsis/ , it just like 0-based/1-based mixture to me (see post Help: TAIR10 GBrowser 1-based mixed with 0-based? ). Luckily we got IGV. (p.s. Plant research cannot triumph mammalian research, look at ucsc genome browser. )

ADD REPLYlink written 3.4 years ago by Puriney300

I came back to this which I had bookmarked but all the images have disappeared. Any chance of bringing them back?

ADD REPLYlink written 3.1 years ago by Daniel3.3k

Images are still there. Browser issues perhaps?

ADD REPLYlink written 3.1 years ago by Obi Griffith14k

The images are blurred...

ADD REPLYlink written 22 months ago by tangming20051.9k

Something wrong about Pseudo-code of converting INS between 0-based and 1-based. In 0-based system, INS coordinate is usually in the format of [X,X), where X is a coordinate. In 0-based, [X,X) INS means sequence is inserted before X, while X is not include. In 1-based system, there is no such mechanism, so you have to convert to [X+1,X+1] to indicate that the sequence is inserted before X+1. If you convert to [X,X+1] as you described above, it makes no sense.

The same thing happened when converting from 1 to 0-based.

So to convert INS from 0 to 1-based system:

start = start + 1; end = end + 1

and to convert INS from 1 to 0-based system:

start = start - 1; end = end - 1

For a further discussion, INS is a very tricky case. Some program does not recognize coordinate like [X,X) in 0-based. So there is another representation of [X,X+1) which also indicate that sequence is inserted before coordinate X. In this situation, the converting will be as follows:

to convert INS from 0 to 1-based system:

start = start + 1; end = end

to convert INS from 1 to 0-based system:

start = start - 1; end = end

For me I vote the second representation, which will unify the converting method.

But I still did not come up with the case you offered above about INS converting.

ADD REPLYlink modified 20 months ago • written 20 months ago by Chen Sun390

Just thought I'd add that the UCSC browser is 1-based, while its tools are 0-based.

ADD REPLYlink written 5 months ago by tantrev10

Incidentally, this makes no mention of zero-based, half-open coordinates, which are convenient. Complete Genomics used to use them. My first variant-processing programs used zero-based, closed coordinates, but later I found that half-open is easier to deal with. So, BBMap's current variant format uses zero-based, half-open as it seems to be the most efficient. 1-based formats always cause problems, so I consider it unfortunate that they were selected for sam and vcf.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Brian Bushnell10k

Is it true that SAM is 1-based but BAM is 0-based? Why would it be designed this way?

ADD REPLYlink written 1 day ago by -_-350

BAM Is zero based. : https://samtools.github.io/hts-specs/SAMv1.pdf

pos 0-based leftmost coordinate (=POS−1) int32t

Why would it be designed this way?

I would say because C programmers like starting things with zero. Anyway, who cares ? unless you're writing your private BAM parser people use an API to read the BAM data. For example htsjdk/SAMRecord.getAlignmentStart() returns the 1-based POS

ADD REPLYlink written 1 day ago by Pierre Lindenbaum91k
0
gravatar for Shicheng Guo
16 months ago by
Shicheng Guo4.2k
Shicheng Guo4.2k wrote:

 

It doesn't make sense to create such two coordinate systems.  One unified system would make it easy to use and do not make us to make mistake. 

ADD COMMENTlink written 16 months ago by Shicheng Guo4.2k
3

That ship has long ago sailed (heck, even fortran and C differ in whether to use 0 or 1 based indexing by default).

ADD REPLYlink written 16 months ago by Devon Ryan63k

Yes, you are right. perl and python use 0 as index of the array.Anyway, our programmer need to remember all the traps. Maybe it is a natural barrier to stop the non-programmer enemy come to our field easily. LoL. Anyway. I like it.

ADD REPLYlink written 16 months ago by Shicheng Guo4.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1464 users visited in the last hour