Question: [solved]bedtools coverageBed: not sorted output
0
gravatar for Chico Fernandez
4.7 years ago by
Chico Fernandez70 wrote:

SOLVED 

I need to calculate coverages for each position in several regions. I use following command:
    coverageBed -abam bam_file -b name_of_bed -d

But finally the coordinates from .bed file are not sorted, in the result file! It reduces the opportunity to use cost-efficient algorithms.

Could you tell me what I am doing wrong? Is there a better way to calculate coverages for each position and keep them in the same order as in original bed file?

Example:

BED-file starts with: chr1    762097    762270

                                 chr1    865591    865791

                                 (and so on, ascending order of coordinates in chromosome)

so I would like to have coverage output to be in the same order. What I have:

first line in the output file: 1       39845644        39846247        1       0

100th line:                        1       917230  917836  516     11

(weird order, neither ascending nor descending)

I re-sorted bam file just in case it was not sorted. It did not help.

UPDATE:

tried to calculate coverage using bedTools v. 2.25 and -sorted flag. Previously output was:

1    39845744    39846147    1    4

1    39845744    39846147    2    4

1    39845744    39846147    3    4

Now:

1    10001    10077    HWUSI-EAS1693_0012:7:41:10492:2084#0/1    0    +    10001    10077    0,0,0    1    76,    0,    1    0

1    10001    10077    HWUSI-EAS1693_0012:7:41:10492:2084#0/1    0    +    10001    10077    0,0,0    1    76,    0,    2    0

1    10001    10077    HWUSI-EAS1693_0012:7:41:10492:2084#0/1    0    +    10001    10077    0,0,0    1    76,    0,    3    0

WTF is this? HWUSI? Code for v.2.15: 

coverageBed -abam Sample.realigned.dm.recalibrated.bam -b Regions.bed -d > output.txt

Code for v.2.25: 

covBed -abam Sample.realigned.dm.recalibrated.bam -b Regions.bed -d -sorted > output.sorted.txt

coverage next-gen • 2.3k views
ADD COMMENTlink modified 4.0 years ago by aiirtime0 • written 4.7 years ago by Chico Fernandez70

How was your bed file sorted? Couldn't you just use the sort command to sort them how you'd like? Or if you'd rather re-sort them to match your original bed file you could just read both into R and use the match function (merge should work too).

ADD REPLYlink written 4.7 years ago by dally190

They are sorted lexicographically. The smaller position on chromosome is - the higher the region is.

ADD REPLYlink written 4.7 years ago by Chico Fernandez70

So to get this right, your bam and bed input is sorted, but your output doesn't come out sorted (because bedtools automatically sorts in its preferred sort order) as you'd like? Could you post a small section of the resulting bedfile and the input bed file and an example of what you want the output to be so we can help.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by dally190

I updated my question. And I have one bam file and one bed file. So I would like to obtain number of reads from bam file covering probes from bed file and to have them sorted as in the bed file.

ADD REPLYlink written 4.7 years ago by Chico Fernandez70

Try @dariober's solution and if that does not work i'll try to give you a R solution.

ADD REPLYlink written 4.7 years ago by dally190

I'm trying now. The files are approx 5 GB each. I can sort output files with a script, but I a) do not want to re-write the file with coverages (more memory + time consumption) b) do not understand how and why bedtools mixes the order of bed specified regions. It makes no sense.

ADD REPLYlink written 4.7 years ago by Chico Fernandez70

Your bed file is 5GB!? In this case you definitely want to use the -sorted option that comes with latest versions of bedtools otherwise it's going to take an awful amount of memory and time (and you get sorted output). By the way, if you stream the output of coverageBed to sort I think the overhead of re-sorting is going to be relatively small compared to the actual counting.

ADD REPLYlink written 4.7 years ago by dariober11k

Sorry, had a limit of 5 messages per day on this forum so could not answer yesterday. Yes, they are exome-seq data. I tried your gnu version - it worked for ages, I waited several hours and then interrupted it. Now will try 2.25 version, hope it will work.

ADD REPLYlink written 4.7 years ago by Chico Fernandez70
3
gravatar for dariober
4.7 years ago by
dariober11k
WCIP | Glasgow | UK
dariober11k wrote:

What version of bedtools do you use? As of version 2.24 the output should be sorted if input is sorted. From https://github.com/arq5x/bedtools2/blob/master/docs/content/history.rst#version-2250-3-sept-2015:

The coverage tool now takes advantage of pre-sorted intervals via the -sorted option. This allows the coverage tool to be much faster, use far less memory, and report coverage for intervals in their original order in the input file.

Alternatively, just use gnu sort as

coverageBed -abam bam_file -b name_of_bed -d | sort -k1,1 -k2,2n > out.bed
ADD COMMENTlink modified 9 months ago by RamRS30k • written 4.7 years ago by dariober11k
0
gravatar for h.mon
4.7 years ago by
h.mon31k
Brazil
h.mon31k wrote:

Sort the .bam and .bed files with samtools sort and sortBed, respectively.

ADD COMMENTlink written 4.7 years ago by h.mon31k

Thank you. But both of them are sorted.

ADD REPLYlink written 4.7 years ago by Chico Fernandez70
0
gravatar for aiirtime
4.0 years ago by
aiirtime0
Germany
aiirtime0 wrote:

Hello, it could be I miss understand something, but where is the Problem solved? I have a similar issue it doesn't depend on the order but on the strange output. I recognized that the coverageBed usage/description changed at the point of the perspective.

Earlier versions "coverageBed" main task description was:

Returns the depth and breadth of coverage of features from A on the intervals in B

which is represented by your first output Chico Fernandez:

1    39845744    39846147    1    4
1    39845744    39846147    2    4
1    39845744    39846147    3    4

and at the version 2.26 "coverageBed" main task description is:

Returns the depth and breadth of coverage of features from B on the intervals in A

you see the perspective is changed for more speed I thing, but the Problem is your output also changed like the second output from Chico Fernandez:

1    10001    10077    HWUSI-EAS1693_0012:7:41:10492:2084#0/1    0    +    10001    10077    0,0,0    1    76,    0,    1    0

1    10001    10077    HWUSI-EAS1693_0012:7:41:10492:2084#0/1    0    +    10001    10077    0,0,0    1    76,    0,    2    0

1    10001    10077    HWUSI-EAS1693_0012:7:41:10492:2084#0/1    0    +    10001    10077    0,0,0    1    76,    0,    3    0

I thought ok, than I have to change only the parameters -a and -b so: I try to force coverageBed from version 2.26 to do the old fashion by changing -a and -b This didn't worked because the tool is crushing. It was the RAM (The bam file with 10Gb and as -b parameter couldn't be load into the RAM with 64Gb, internal stuff makes the file bigger? could it be?)

Is there a possibility to use coverageBed like before? Instead of using the old version?

Best airtime

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by aiirtime0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1665 users visited in the last hour