Question: Bedtools sortBed | uniq and bash sort | uniq returns different number of lines
6
gravatar for QVINTVS_FABIVS_MAXIMVS
15 months ago by
USA SoCal
QVINTVS_FABIVS_MAXIMVS2.0k wrote:

Why do I get different number of lines when I use bash sort vs Bedtools sortBed

$ cut -f 1-4 BED.bed | wc -l

295658

-

$ cut -f 1-4 BED.bed | sort -k1,1 -k2,2n  | uniq | wc -l

236207

_

$ cut -f 1-4 BED.bed | sort | uniq | wc -l

236207

_

$ cut -f 1-4 BED.bed | sortBed | uniq | wc -l 

289619

-

$ cut -f 1-4 BED.bed | head -n 5

 chr1   703951  761450  DUP
 chr1   839849  839969  DEL
 chr1   839858  839956  DEL
 chr1   839859  839919  DEL
 chr1   839872  839933  DEL
bash uniq sort bed bedtools • 895 views
ADD COMMENTlink modified 15 months ago by ATPoint1.1k • written 15 months ago by QVINTVS_FABIVS_MAXIMVS2.0k
2

Have you tried to diff the results by capturing them to files to see where the differences are?

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax33k

I think you're onto something. I'm not familiar with diff but a quick read over the command I think sortBed is not sorting the end position as accurately?

first file '<' is sortBed

second file '>' is sort -k1,1 -k2,2n

289422c236017
< chrX  154871701   154872150   DEL
---
> chrX  154871701   154871850   DEL
289423a236019,236021
> chrX  154871701   154871950   DEL
> chrX  154871701   154872150   DEL
> chrX  154871751   154871850   DEL
ADD REPLYlink modified 15 months ago • written 15 months ago by QVINTVS_FABIVS_MAXIMVS2.0k

Looks like you are not sorting the third column in your unix sort. Is that always unique?

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax33k
1

What answer does sort-bed return?

ADD REPLYlink written 15 months ago by Alex Reynolds20k
$ cut -f 1-4 BED.bed | sort-bed - | uniq | wc -l

 236243
ADD REPLYlink written 15 months ago by QVINTVS_FABIVS_MAXIMVS2.0k
7
gravatar for ATPoint
15 months ago by
ATPoint1.1k
ATPoint1.1k wrote:

By default, bedtools sorts by chr and start, ignoring the end-coordinate. Therefore, it might happen that you get something like this:

chr1 4 15

chr1 4 17

chr1 4 15

Uniq iterates from the top to the bottom of this file. Therefore, it recognizes duplicated reads only, if they appear right one below the other, which is not the case in the above example.

I recommend you not to use bedtools for sorting, as indicated on the bedtools::sort manual page. Unix sort is faster and more memory efficient. If you want to combine sorting of your bed with deduplication, you can use the following command:

sort -k1,1 -k2,2n -k3,3n -k6,6 -u input.bed > output.bed

This command takes chr, start, end and strand into account, which are the essential information to describe a unique fragment. The -u then acts on all the columns, that were provided in the command, generating a deduplicated file.

If you are on a Mac, you may want to get the GNU core utilities e.g. via homebrew, as GNU sort provides a nice --parallel option for multi-threading.

ADD COMMENTlink modified 15 months ago • written 15 months ago by ATPoint1.1k

Thank you. After doing diff I kinda assumed your answer. I'll stick to UNIX sorting now. Just curious on why bedSort returned different values.

ADD REPLYlink written 15 months ago by QVINTVS_FABIVS_MAXIMVS2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 688 users visited in the last hour