Official Bed Vs Gff Coordinate Conventions?
Entering edit mode
8.6 years ago
user ▴ 870

What are the official differences between start/end coordinate conventions in BED and GFF? In here the bed format says that the start coordinate is 0-based but does not say what the format of the end coordinate. GFF is definitely 1-based and generally is fully closed. Is BED only half-closed (half-open)? So:

GFF: chr 1 100

would be in BED: chr 0 100

That means that if we want 50 bp long interval from start coordinate x, in GFF, it's [x, x+50-1], but in BED it's [x, x+50]. The lengths computations are similarly different. Is this all official standard or just convention that most people adhere to? Can you count on no tool ever outputting half-open GFF coordinates?

coordinates genomics genome bed gff3 gff • 6.1k views
Entering edit mode
8.6 years ago

Is this all official standard or just convention that most people adhere to?

This is in the specification for GFF3:

Columns 4 & 5: "start" and "end"

The start and end coordinates of the feature are given in positive 1-based integer coordinates, relative to the landmark given in column one. Start is always less than or equal to end. For features that cross the origin of a circular feature (e.g. most bacterial genomes, plasmids, and some viral genomes), the requirement for start to be less than or equal to end is satisfied by making end = the position of the end + the length of the landmark feature.

For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark.

As for BED, 0-indexing is hinted at in the UCSC documentation here:

chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.

chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

And here:

If you submit data to the browser in position format (chr#:##-##), the browser assumes this information is 1-based. If you submit data in any other format (BED (chr# ## ##) or otherwise), the browser will assume it is 0-based. You can see this both in our liftOver utility and in our search bar, by entering the same numbers in position or BED format and observing the results. Similarly, any data returned by the browser in position format is 1-based, while data returned in BED, wiggle, etc is 0-based.

These are the specifications we follow for our GFF3-to-BED and other conversion utility scripts. But convention is whatever people use, and labs are known to do their own thing (as we found out with GFF3, as it happens, which broke one of our analysis pipelines). You can write tools that rely on conventions, but nothing beats healthy skepticism about how standards are interpreted and judicious use of debugging tools when there is data "smell".

Entering edit mode
8.6 years ago
KCC ★ 4.0k

This has been asked before I think. BED is 0-based and half-open. So, "chr 1 100" in a GFF file is "chr 0 100" in BED.


Login before adding your answer.

Traffic: 2505 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6