Question: How To Sort Bed Format File
7
gravatar for lyz10302012
5.3 years ago by
lyz10302012270
China
lyz10302012270 wrote:

Is there any tools or commands to sort bed format file? The correct order is chr1, chr2, chr3, chr4,..., chr22, chrX, chrY.

bed • 24k views
ADD COMMENTlink modified 8 months ago by Paul1.1k • written 5.3 years ago by lyz10302012270
14
gravatar for tflutre
4.8 years ago by
tflutre450
tflutre450 wrote:

GNU sort can now sort in alpha-numeric order. See option "-V, --version-sort" of the following version "sort (GNU coreutils) 8.17".

Here is an example:

$ echo -e "chr10\t1\t2\tA\nchr2\t1\t2\tB\nchr1\t1\t2\tC"
chr10   1       2       A
chr2    1       2       B
chr1    1       2       C

And the result is:

$ echo -e "chr10\t1\t2\tA\nchr2\t1\t2\tB\nchr1\t1\t2\tC" | sort -k1,1V
chr1    1       2       C
chr2    1       2       B
chr10   1       2       A
ADD COMMENTlink written 4.8 years ago by tflutre450

thank you so much for this information !

ADD REPLYlink written 4.8 years ago by Pierre Lindenbaum108k

what a nice side-effect

ADD REPLYlink written 2.1 years ago by Michi930

what kind of side effect?

ADD REPLYlink written 21 months ago by Dataman250
7
gravatar for lh3
5.3 years ago by
lh330k
United States
lh330k wrote:

Use my version of GNU sort:

make
./sort -k1,1N -k2,2n unsrt.bed > srt.bed

It sorts chromosome names to the alpha-numeric order.

EDIT: -k1,1N is NOT a typo. It is a new sorting order - alphanumeric order - only available in my version of sort but not in the standard Unix sort. If you run:

echo chr10 chr5 chrX chr2 | tr " " "\n" | ./sort -N

You will get:

chr2
chr5
chr10
chrX
ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by lh330k

This returns the error sort: stray character in field spec: invalid field specification ‘1,1N’

ADD REPLYlink written 5.3 years ago by Alex Reynolds24k
1

isn't this just a "N"->"n" typo?

ADD REPLYlink written 5.3 years ago by daiefa123150

You need to use "./sort" or specify the full path (you can also rename it). Don't use the system sort. "N" is a new feature which is not implemented in the standard GNU sort.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by lh330k
6
gravatar for Alex Reynolds
5.3 years ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

Let's say you have the following data:

$ more foo.bed
chr1    1       2
chr4    7       8
chrX    100     101
chr11   9       100
chr11   9       99
chr20   11      12
chr2    3       4
chr3    5       6

You'd probably want to sort on the first column, then sort on both the second and third columns:

$ sort -k1,1n -k2,2n -k3,3n foo.bed
chr1    1    2
chr2    3    4
chr3    5    6
chr4    7    8
chr11    9    99
chr11    9    100
chr20    11    12
chrX    100    101

Sorting only on the first and second columns may not guarantee expected ordering, where the start coordinates are equal. To demonstrate, take a look at the chr11 records below, where the stop coordinates are presumably out-of-order (100 is greater than 99, but is printed first with this use of UNIX sort):

$ sort -k1,1n -k2,2n foo.bed
chr1    1    2
chr2    3    4
chr3    5    6
chr4    7    8
chr11    9    100
chr11    9    99
chr20    11    12
chrX    100    101

In any case, note that neither of these two orderings would be the usual "correct" sort order that would go into BEDOPS tools (or now perhaps bedtools' merge operation), which work most efficiently and correctly with sort-bed ordering:

$ sort-bed foo.bed
chr1    1    2
chr11    9    99
chr11    9    100
chr2    3    4
chr20    11    12
chr3    5    6
chr4    7    8
chrX    100    101

Which sort you use depends on what tool or pipeline consumes the sorted output, and what ordering is expected.

ADD COMMENTlink modified 5.0 years ago • written 5.3 years ago by Alex Reynolds24k
3
gravatar for Pierre Lindenbaum
5.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum108k wrote:

Use sed (with option -f) to transform your chromosome names into a set of sortable strings, sort and retransform the names:

the original bed file:

$ cat input.bed
chr1    1    2
chr9    1    2
chr1    1    2
chr10    1    2
chr1    1    2
chr2    1    2
chr12    1    2
chr3    1    2

the sed file transforming the chrom to a sortable field (chr1->01 ; chr10 -> 10 ... )

$ cat sed1.sed 
s/^chr1    /01    /
s/^chr10    /10    /
s/^chr12    /12    /
s/^chr2    /02    /
s/^chr3    /03    /
s/^chr9    /09    /

the sed file for the reverse process: 01->chr1 ; 10->chr10

$ cat sed2.sed 
s/^01    /chr1    /
s/^10    /chr10    /
s/^12    /chr12    /
s/^02    /chr2    /
s/^03    /chr3    /
s/^09    /chr9    /

all in one:

$ sed -f sed1.sed < input.bed | sort  -t '     ' -k1,1 -k2,12n | sed -f sed2.sed 
chr1    1    2
chr1    1    2
chr1    1    2
chr2    1    2
chr3    1    2
chr9    1    2
chr10    1    2
chr12    1    2
ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Pierre Lindenbaum108k
3
gravatar for Korsocius
3.3 years ago by
Korsocius80
Korsocius80 wrote:
The easiest what I am using is :

sort -V -k1,1 -k2,2 test.bed

And you have the order of chrom: 1 2 3 4 5 6 7 8 9 10 ....
ADD COMMENTlink written 3.3 years ago by Korsocius80
2
gravatar for nnutter
5.3 years ago by
nnutter200
St. Louis, MO
nnutter200 wrote:

I think JoinX does what you are requesting. It is a tool that The Genome Institue at Washington University has released to perform set operations on BED files and maybe other chr/pos formatted files.

Here's an example:

$ cat input.bed 
chr1    1       2
chr9    1       2
chr1    1       2
chr10   1       2
chrX    1       2
chr1    1       2
chr2    1       2
chr12   1       2
chr3    1       2
$ joinx sort -i input.bed                                                       
chr1    1       2
chr1    1       2
chr1    1       2
chr2    1       2
chr3    1       2
chr9    1       2
chr10   1       2
chr12   1       2
chrX    1       2
ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by nnutter200
2
gravatar for rjactonspsfcf
12 months ago by
rjactonspsfcf30 wrote:

Example Bed file:

chr3    1   20
chr4    20  30
chr1    2   10
chr3    10  15
chrX    20  30
chrY    20  40
chrX    5   25
chr3    10  40
chr1    1   9
chr10   20  30
chr22   12  24
chr11   10  90

Sort Command:

sort -k1,1V -k2,2n -k3,3n "file"

-k1,1V: Sort 1st field alphabetically, recognising the 10 (NB -V is available in GNU coreutils >8.17 as pointed out by @tflutre above)

-k2,2n: Sort 2nd field numerically, loci which start first in a chromosome come first

-k3,3n: Sort 3rd field numerically, loci which end first come first when they have the same start position.

Result:

chr1    1   9
chr1    2   10
chr3    1   20
chr3    10  15
chr3    10  40
chr4    20  30
chr10   20  30
chr11   10  90
chr22   12  24
chrX    5   25
chrX    20  30
chrY    20  40

To make this command available as a short hand e.g. sortbed in your environment: edit ~/.bashrc adding:

# sortbed (or some other description)
alias sortbed="sort -k1,1V -k2,2n -k3,3n"

and run source ~/.bashrc

invoking:

sortbed "file"

should now have the same effect as the full command.

ADD COMMENTlink written 12 months ago by rjactonspsfcf30

I think it is not necessary to sort by -k3,3n... It is logical from definition of bed file.

ADD REPLYlink written 12 months ago by Korsocius80

This only crops up when you have a bed file with overlapping features e.g.

chr1 1 10
chr1 1 5

would not be ordered as:

chr1 1 5
chr1 1 10

without sorting on -k3,3n

ADD REPLYlink written 12 months ago by rjactonspsfcf30

Yep, see: A: How To Sort Bed Format File

ADD REPLYlink written 12 months ago by Alex Reynolds24k
1
gravatar for chikaharu
9 months ago by
chikaharu10
Japan
chikaharu10 wrote:

More simply way

sort -Vk1,2

With the OSX standard sort command, it probably does not work with the V option, so please use GNU sort.

ADD COMMENTlink written 9 months ago by chikaharu10
0
gravatar for rhileighalmgren
2.6 years ago by
Iceland
rhileighalmgren20 wrote:

To make a bed file match any genome hosted by ucsc, download or make the chom.sizes file for that genome. (e.g. Kent tools fetchChromSizes script or e.g. https://genome.ucsc.edu/goldenpath/help/hg19.chrom.sizes)

Then:

/me/tools/htslib/1.1/bin/bgzip --stdout ${inputBase}.noheader.noOverlap.bed >${inputBase}.noheader.noOverlap.bed.gz

 /me/tools/htslib/1.1/bin/tabix -p vcf  ${inputBase}.noheader.noOverlap.bed.gz

cat ${chromsizes} | cut -f1 | xargs /me/tools/htslib/1.1/bin/tabix -h ${inputBase}.noheader.noOverlap.bed.gz > ${outputBase}.bed

ADD COMMENTlink written 2.6 years ago by rhileighalmgren20
0
gravatar for nurithec
2.5 years ago by
nurithec0
nurithec0 wrote:

try this function. It was written it in R for Arabidopsis' so teh number of chromosomes are 5' but you can change it.

order.data = function (bed) #BED like data
{
  o = order (bed[, "chr"])
  bed = bed[o,]
  #order data: od
  od = 1:ncol (bed)
  for (i in 1:5)
  {
    chr.bed = bed[bed[, "chr"]==i,]
    o = order (as.integer(as.character(chr.bed[,"start"])))
    od = rbind (od, chr.bed[o, ])
  }
  od = od [-1,]
  colnames (od)= colnames(bed)
  order.data = od
}

 

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by nurithec0
0
gravatar for Paul
8 months ago by
Paul1.1k
European Union
Paul1.1k wrote:

I am using simply:

sort -k 1V,1 -k 2n,2 input.bed
ADD COMMENTlink written 8 months ago by Paul1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1003 users visited in the last hour