Bed Coordinates
2
2
Entering edit mode
11.3 years ago
Florianino ▴ 30

Hi all,

I have installed bedtool and tried fastafromBED but it looks like when I ask for positions 1 to 25, it gives me 2 to 25 instead in the output. How come?

I had posted that as a comment and got a first reply:

"BED format uses zero-based, half-open coordinates, so the first 25 bases of a sequence are in the range 0-25 (those bases being numbered 0 to 24). – Keith James♦ Mar 12 at 16:33"

So BED coordinates are different from GFF3 for example? How to confidently reformat columns of start-stop intervals before extracting coordinates using BEDtools?

bed coordinates format • 6.6k views
1
Entering edit mode

You may want to see this related question on the pros/cons of different coordinate systems: What Are The Advantages/Disadvantages Of One-Based Vs. Zero-Based Genome Coordinate Systems

0
Entering edit mode

You may want to see this related question on the pros/cons of different coordinate systems.

3
Entering edit mode
11.3 years ago

So BED coordinates are different from GFF3 for example?

Yes, there is a +/-1 shift. See http://genome.ucsc.edu/FAQ/FAQformat.html#format1

chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.

chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

As for

So BED coordinates are different from GFF3 for example? How to confidently reformat columns of start-stop intervals before extracting coordinates using BEDtools?

You can simply use awk. For example:

echo -e "chr1\t1\t100" | awk '{printf("%s\t%d\t%d\n",$1,int($2)-1,int(\$3));}'
chr1    0   100

1
Entering edit mode
10.9 years ago
Rlong ▴ 340

I have found it useful to think of bed coordinates as marking the spaces between the the bases, rather than the bases themselves. I will try to represent this:

[?][?]

| A | C | G | T | A | C | G | T |[?]

0 | 1 | 2 | 3 | [?]4 | 5 |[?] 6 | 7 [?]| 8

So if you wanted to describe the first base, it would be:

chr[?][?][?][?]0[?][?][?][?]1

and GTAC:

chr[?][?][?][?]2[?][?][?][?]6

Another handy thing to note, you should always be able to subtract the start from the end to get the length of the bases you are describing, except in the case of insertions, which is the only case when you should have a start == stop. This should make sense in this scheme, since you are really only calling out a position between two bases, where a bit of sequence has been inserted.