Question

bedops sort-bed inconsistencies

1

Entering edit mode

5.4 years ago

Ram 45k

Hi,

I'm working on uniquifying a BED file. I tried the version of bedops installed in my HPC, and then since it did not have an option to uniquify on the fly, I downloaded the latest version of bedops from GitHub. The numbers I see are different and I don't know why that's happening.

Note: The line counts below are not completely genuine, as I masked the first four digits (which were uniform in all wc outputs). The BED file is 7.5G in size and smaller numbers are easier to eyeball-compare.

wc -l file.bed
xxxx4305 

module load bedops #this loads bedops/2.4.2
sort-bed file.bed | uniq | wc -l
xxxx4305

sort -u file.bed | wc -l
xxxx3670

The above difference shows that sort-bed and sort work differently. The BED file is all numbers with no header, so I don't see why this difference should happen.

sort-bed from the current version (2.4.37) of bedops has a --unique option built in, so I installed that from github

/new/version/sort-bed --unique file.bed | wc -l
xxxx3035 #where did this number come from???

/new/version/sort-bed file.bed | uniq | wc -l
xxxx3670 #same as previous sort-bed | uniq combo

What could be happening here? Could the newer version be finding more duplicates because it uses memory more efficiently? Why is sort-bed | uniq not the same as sort -u?

I'll be getting raw output and running a diff to dig into what's happening here but in the meantime, I'd appreciate any pointers. Thank you!

bedops sort-bed • 1.7k views

ADD COMMENT • link updated 5.4 years ago by Alex Reynolds 36k • written 5.4 years ago by Ram 45k

score 1 · Answer 1 · 2020-03-06

1

Entering edit mode

5.4 years ago

Alex Reynolds 36k

Constants determine limits to line lengths, which could present an issue for use of default BEDOPS binaries if you have a 7.5G BED file with 4305 lines.

There is a so-called megarow build option for BEDOPS that compiles binaries that support lines up to 4M+ characters.

You could use make megarow to build this binary type: https://bedops.readthedocs.io/en/latest/content/installation.html#installation-via-source-code

This is useful for converting and sorting ultra-long reads from Nanopore and PacBio sequencing platforms to BED via bam2bed or convert2bed.

There may be some other issue or bug, but I suggest trying the megarow build with your dataset and see if you get similarly odd numbers.

ADD COMMENT • link 5.4 years ago by Alex Reynolds 36k

0

Entering edit mode

Alex, the 4305 is dummy. There are 4 more digits that go before it, but they’re conserved across all wc operations, which is why I did not include them. The actual line count is >10,000 times the numbers shown here.

EDIT: I’ve edited the question to make it a little clearer that the bed files have a lot more than 5000 entries.

ADD REPLY • link 5.4 years ago by Ram 45k

0

Entering edit mode

I think the bug is with the help statement I wrote.

The --unique operation reports unique elements from sort-bed. This is not the same as sort -u (in spite of the documentation). This reports those elements from sorting which only appear once (which are unique).

The --duplicates operation reports duplicate elements from sort-bed.

Duplicate elements must appear more than once in the input, but they are only reported once.

I think it should be enough to do the following (assuming the bash shell is used) to get an answer consistent with sort -u:

$ cat <(sort-bed --unique file.bed) <(sort-bed --duplicates file.bed) | wc -l

If that isn't the case, I'd like to know that. Otherwise, I'll definitely need to fix the documentation in v2.4.38 binaries and on the readthedocs site. Or I may rename the options. But the functionality should be correct, insofar as the original intent of the options. I apologize for the documentation being misleading.

Testing:

$ cat test.bed
chr1    100 200
chr1    100 200
chr1    10  50
chr3    200 220
chr1    10  50

$ ./bin/sort-bed --unique test.bed
chr3    200 220

$ ./bin/sort-bed --duplicates test.bed
chr1    10  50
chr1    100 200

$ cat <(sort-bed --unique test.bed) <(sort-bed --duplicates test.bed) | wc -l
       3

ADD REPLY • link 5.4 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you, Alex. This explains why the latest version of sort-bed doesn’t match expectations. I’m still confused why sort -u file.bed and sort-bed file.bed | uniq give different results.

ADD REPLY • link 5.4 years ago by Ram 45k

0

Entering edit mode

I tried the sort-bed-megarow binary from the latest bundle, and it gives me the same number of rows as sort-bed does. My BED file has longer rows than I thought - line length ranges from 31 to 1804.

When I picked just the 4 fields (all numeric) that I need from the BED file, and then compared the output of sort-u to sort-bed | uniq, I see the same number of rows. I guess the extra columns (especially the alphanumeric column with ENS gene information) interfered with the logic or limited the memory usage somehow.

ADD REPLY • link 5.4 years ago by Ram 45k

score 0 · Answer 2 · 2020-03-09

I think sort alone on BED output will only sort on the first column, and so it will not usually ever give the same answer as sort-bed or even sortBed (except on trivial input, where each element is on a unique chromosome — maybe the reformatted output of fetchChromSizes but that's it?).

Piping that result to uniq would probably, therefore, give you a different answer, as identical BED rows would not be necessarily grouped and so could be counted more frequently.

There's a Biostars post here that demonstrates the effects of different applications of sort options on BED files, and the side effects from each and how they compare and differ with the output of other tools: How To Sort Bed Format File

A lot of uses of sort on BED files posted on the web simply sort on the first two columns, but I suggest in that post to instead sort on the first three fields, to correctly resolve the stop position. This emulates what sort-bed does, though not as quickly, and it makes the output order correct for the purposes of use in getting correct answers from set operations.