How To Sort Bed Format File
12
10
Entering edit mode
11.6 years ago
lyz10302012 ▴ 470

Is there any tools or commands to sort bed format file? The correct order is chr1, chr2, chr3, chr4,..., chr22, chrX, chrY.

bed • 63k views
ADD COMMENT
17
Entering edit mode
11.0 years ago
tflutre ▴ 580

GNU sort can now sort in alpha-numeric order. See option -V, --version-sort of the following version "sort (GNU coreutils) 8.17".

Here is an example:

$ echo -e "chr10\t1\t2\tA\nchr2\t1\t2\tB\nchr1\t1\t2\tC"
chr10   1       2       A
chr2    1       2       B
chr1    1       2       C

And the result is:

$ echo -e "chr10\t1\t2\tA\nchr2\t1\t2\tB\nchr1\t1\t2\tC" | sort -k1,1V
chr1    1       2       C
chr2    1       2       B
chr10   1       2       A
ADD COMMENT
0
Entering edit mode

thank you so much for this information !

ADD REPLY
0
Entering edit mode

what a nice side-effect

ADD REPLY
0
Entering edit mode

what kind of side effect?

ADD REPLY
15
Entering edit mode
9.6 years ago
Korsocius ▴ 250

The easiest what I am using is :

sort -V -k1,1 -k2,2 test.bed

And you have the order of chrom:

1 2 3 4 5 6 7 8 9 10 ....
ADD COMMENT
0
Entering edit mode

what does k1 and k2 means?

ADD REPLY
11
Entering edit mode
7.3 years ago
rjactonspsfcf ▴ 180

Example Bed file:

chr3    1   20
chr4    20  30
chr1    2   10
chr3    10  15
chrX    20  30
chrY    20  40
chrX    5   25
chr3    10  40
chr1    1   9
chr10   20  30
chr22   12  24
chr11   10  90

Sort Command:

sort -k1,1V -k2,2n -k3,3n "file"
  • -k1,1V: Sort 1st field alphabetically, recognising the 10 (NB -V is available in GNU coreutils >8.17 as pointed out by @tflutre above)
  • -k2,2n: Sort 2nd field numerically, loci which start first in a chromosome come first
  • -k3,3n: Sort 3rd field numerically, loci which end first come first when they have the same start position.

Result:

chr1    1   9
chr1    2   10
chr3    1   20
chr3    10  15
chr3    10  40
chr4    20  30
chr10   20  30
chr11   10  90
chr22   12  24
chrX    5   25
chrX    20  30
chrY    20  40

To make this command available as a short hand e.g. sortbed in your environment, edit ~/.bashrc and add:

# sortbed (or some other description)
alias sortbed="sort -k1,1V -k2,2n -k3,3n"

and run source ~/.bashrc

invoking:

sortbed "file"

should now have the same effect as the full command.

ADD COMMENT
0
Entering edit mode

I think it is not necessary to sort by -k3,3n... It is logical from definition of bed file.

ADD REPLY
0
Entering edit mode

This only crops up when you have a bed file with overlapping features e.g.

chr1 1 10
chr1 1 5

would not be ordered as:

chr1 1 5
chr1 1 10

without sorting on -k3,3n

ADD REPLY
0
Entering edit mode
ADD REPLY
10
Entering edit mode
11.6 years ago
lh3 33k

Use my version of GNU sort:

make
./sort -k1,1N -k2,2n unsrt.bed > srt.bed

It sorts chromosome names to the alpha-numeric order.

EDIT: -k1,1N is NOT a typo. It is a new sorting order - alphanumeric order - only available in my version of sort but not in the standard Unix sort. If you run:

echo chr10 chr5 chrX chr2 | tr " " "\n" | ./sort -N

You will get:

chr2
chr5
chr10
chrX
ADD COMMENT
0
Entering edit mode

This returns the error

sort: stray character in field spec: invalid field specification ā€˜1,1Nā€™
ADD REPLY
1
Entering edit mode

isn't this just a "N"->"n" typo?

ADD REPLY
0
Entering edit mode

You need to use ./sort or specify the full path (you can also rename it). Don't use the system sort. "N" is a new feature which is not implemented in the standard GNU sort.

ADD REPLY
8
Entering edit mode
11.6 years ago

Let's say you have the following data:

$ more foo.bed
chr1    1       2
chr4    7       8
chrX    100     101
chr11   9       100
chr11   9       99
chr20   11      12
chr2    3       4
chr3    5       6

You'd probably want to sort on the first column, then sort on both the second and third columns:

$ sort -k1,1 -k2,2n -k3,3n foo.bed
chr1    1    2
chr2    3    4
chr3    5    6
chr4    7    8
chr11    9    99
chr11    9    100
chr20    11    12
chrX    100    101

Sorting only on the first and second columns may not guarantee expected ordering, where the start coordinates are equal. To demonstrate, take a look at the chr11 records below, where the stop coordinates are presumably out-of-order (100 is greater than 99, but is printed first with this use of UNIX sort):

$ sort -k1,1 -k2,2n foo.bed
chr1    1    2
chr2    3    4
chr3    5    6
chr4    7    8
chr11    9    100
chr11    9    99
chr20    11    12
chrX    100    101

In any case, note that neither of these two orderings would be the usual "correct" sort order that would go into BEDOPS tools (or now perhaps bedtools' merge operation), which work most efficiently and correctly with sort-bed ordering:

$ sort-bed foo.bed
chr1    1    2
chr11    9    99
chr11    9    100
chr2    3    4
chr20    11    12
chr3    5    6
chr4    7    8
chrX    100    101

Which sort you use depends on what tool or pipeline consumes the sorted output, and what ordering is expected.

ADD COMMENT
3
Entering edit mode
11.6 years ago

Use sed (with option -f) to transform your chromosome names into a set of sortable strings, sort and retransform the names:

the original bed file:

$ cat input.bed
chr1    1    2
chr9    1    2
chr1    1    2
chr10    1    2
chr1    1    2
chr2    1    2
chr12    1    2
chr3    1    2

the sed file transforming the chrom to a sortable field (chr1 -> 01 ; chr10 -> 10 ... )

$ cat sed1.sed 
s/^chr1    /01    /
s/^chr10    /10    /
s/^chr12    /12    /
s/^chr2    /02    /
s/^chr3    /03    /
s/^chr9    /09    /

the sed file for the reverse process: 01 -> chr1 ; 10 -> chr10

$ cat sed2.sed 
s/^01    /chr1    /
s/^10    /chr10    /
s/^12    /chr12    /
s/^02    /chr2    /
s/^03    /chr3    /
s/^09    /chr9    /

all in one:

$ sed -f sed1.sed < input.bed | sort  -t '     ' -k1,1 -k2,12n | sed -f sed2.sed 
chr1    1    2
chr1    1    2
chr1    1    2
chr2    1    2
chr3    1    2
chr9    1    2
chr10    1    2
chr12    1    2
ADD COMMENT
2
Entering edit mode
11.6 years ago
nnutter ▴ 210

I think JoinX does what you are requesting. It is a tool that The Genome Institute at Washington University has released to perform set operations on BED files and maybe other chr/pos formatted files.

Here's an example:

$ cat input.bed 
chr1    1       2
chr9    1       2
chr1    1       2
chr10   1       2
chrX    1       2
chr1    1       2
chr2    1       2
chr12   1       2
chr3    1       2
$ joinx sort -i input.bed                                                       
chr1    1       2
chr1    1       2
chr1    1       2
chr2    1       2
chr3    1       2
chr9    1       2
chr10   1       2
chr12   1       2
chrX    1       2
ADD COMMENT
1
Entering edit mode
8.7 years ago
nurithec ▴ 20

try this function. It was written it in R for Arabidopsis' so the number of chromosomes are 5' but you can change it.

order.data = function (bed) #BED like data
{
  o = order (bed[, "chr"])
  bed = bed[o,]
  #order data: od
  od = 1:ncol (bed)
  for (i in 1:5)
  {
    chr.bed = bed[bed[, "chr"]==i,]
    o = order (as.integer(as.character(chr.bed[,"start"])))
    od = rbind (od, chr.bed[o, ])
  }
  od = od [-1,]
  colnames (od)= colnames(bed)
  order.data = od
}
ADD COMMENT
1
Entering edit mode
7.0 years ago
chikaharu ▴ 10

More simply way

sort -Vk1,2

With the OSX standard sort command, it probably does not work with the V option, so please use GNU sort.

ADD COMMENT
1
Entering edit mode
6.9 years ago
Paul ★ 1.5k

I am using simply:

sort -k 1V,1 -k 2n,2 input.bed
ADD COMMENT
0
Entering edit mode
8.8 years ago

To make a bed file match any genome hosted by ucsc, download or make the chom.sizes file for that genome. (e.g. Kent tools fetchChromSizes script or e.g. https://genome.ucsc.edu/goldenpath/help/hg19.chrom.sizes )

Then:

/me/tools/htslib/1.1/bin/bgzip --stdout ${inputBase}.noheader.noOverlap.bed >${inputBase}.noheader.noOverlap.bed.gz
/me/tools/htslib/1.1/bin/tabix -p vcf ${inputBase}.noheader.noOverlap.bed.gz
cat ${chromsizes} | cut -f1 | xargs /me/tools/htslib/1.1/bin/tabix -h ${inputBase}.noheader.noOverlap.bed.gz > ${outputBase}.bed
ADD COMMENT

Login before adding your answer.

Traffic: 1897 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6