By default, bedtools sorts by chr and start, ignoring the end-coordinate. Therefore, it might happen that you get something like this:
chr1 4 15
chr1 4 17
chr1 4 15
Uniq iterates from the top to the bottom of this file. Therefore, it recognizes duplicated reads only, if they appear right one below the other, which is not the case in the above example.
I recommend you not to use bedtools for sorting, as indicated on the bedtools::sort manual page. Unix sort is faster and more memory efficient. If you want to combine sorting of your bed with deduplication, you can use the following command:
sort -k1,1 -k2,2n -k3,3n -k6,6 -u input.bed > output.bed
This command takes chr, start, end and strand into account, which are the essential information to describe a unique fragment. The -u then acts on all the columns, that were provided in the command, generating a deduplicated file.
If you are on a Mac, you may want to get the GNU core utilities e.g. via homebrew, as GNU sort provides a nice --parallel option for multi-threading.
modified 2.0 years ago
2.0 years ago by
ATpoint • 4.3k