Hi,
I'm working on uniquifying a BED file. I tried the version of bedops installed in my HPC, and then since it did not have an option to uniquify on the fly, I downloaded the latest version of bedops from GitHub. The numbers I see are different and I don't know why that's happening.
Note: The line counts below are not completely genuine, as I masked the first four digits (which were uniform in all wc
outputs). The BED file is 7.5G in size and smaller numbers are easier to eyeball-compare.
wc -l file.bed
xxxx4305
module load bedops #this loads bedops/2.4.2
sort-bed file.bed | uniq | wc -l
xxxx4305
sort -u file.bed | wc -l
xxxx3670
The above difference shows that sort-bed
and sort
work differently. The BED file is all numbers with no header, so I don't see why this difference should happen.
sort-bed
from the current version (2.4.37) of bedops has a --unique
option built in, so I installed that from github
/new/version/sort-bed --unique file.bed | wc -l
xxxx3035 #where did this number come from???
/new/version/sort-bed file.bed | uniq | wc -l
xxxx3670 #same as previous sort-bed | uniq combo
What could be happening here? Could the newer version be finding more duplicates because it uses memory more efficiently? Why is sort-bed | uniq
not the same as sort -u
?
I'll be getting raw output and running a diff to dig into what's happening here but in the meantime, I'd appreciate any pointers. Thank you!
Alex, the 4305 is dummy. There are 4 more digits that go before it, but they’re conserved across all
wc
operations, which is why I did not include them. The actual line count is >10,000 times the numbers shown here.EDIT: I’ve edited the question to make it a little clearer that the bed files have a lot more than 5000 entries.
I think the bug is with the help statement I wrote.
The
--unique
operation reports unique elements fromsort-bed
. This is not the same assort -u
(in spite of the documentation). This reports those elements from sorting which only appear once (which are unique).The
--duplicates
operation reports duplicate elements fromsort-bed
.Duplicate elements must appear more than once in the input, but they are only reported once.
I think it should be enough to do the following (assuming the
bash
shell is used) to get an answer consistent withsort -u
:If that isn't the case, I'd like to know that. Otherwise, I'll definitely need to fix the documentation in v2.4.38 binaries and on the readthedocs site. Or I may rename the options. But the functionality should be correct, insofar as the original intent of the options. I apologize for the documentation being misleading.
Testing:
Thank you, Alex. This explains why the latest version of sort-bed doesn’t match expectations. I’m still confused why
sort -u file.bed
andsort-bed file.bed | uniq
give different results.I tried the
sort-bed-megarow
binary from the latest bundle, and it gives me the same number of rows assort-bed
does. My BED file has longer rows than I thought - line length ranges from 31 to 1804.When I picked just the 4 fields (all numeric) that I need from the BED file, and then compared the output of
sort-u
tosort-bed | uniq
, I see the same number of rows. I guess the extra columns (especially the alphanumeric column with ENS gene information) interfered with the logic or limited the memory usage somehow.