Question

bedops alternative for bedtools -unionbedg ?

0

Entering edit mode

8 months ago

Agastya ▴ 10

Hi, I have 200 bedgraphs that look like this:

chr1    256    257    4
chr1    257    261    1
chr1    261    264    2

Basically, the 4 columns are the chrom, start, stop, and depth in that order. Now, each bedgraph is 25-35GBs in size. I want to combine all the 200 bedgraphs into one. Using bedtools -unionbedg is taking too long (its been running for 4 days now). I have been suggested bedops as being faster. I cannot find any function that is analogous to unionbedg in bedops. Can anyone suggest a way to solve this problem?

bedtools bam coverage bedops bed • 1.6k views

ADD COMMENT • link updated 7 months ago by Alex Reynolds 35k • written 8 months ago by Agastya ▴ 10

0

Entering edit mode

Do you need to work in bedgraph space? The file size is likely smaller with bigwig files and you could use something like wiggletools.

ADD REPLY • link 8 months ago by Trivas ★ 1.7k

0

Entering edit mode

If you really have 200 files with that tiny intervals, I see why it takes that long. I think mapping should be faster, so bedtools map should be an alternative (mapping bigwigs as Trivas suggested is a very good idea!). To create the intervals to map to, you could use bedtools makewindows for uniform intervals or bedtools multiinter if you want to retain exactly the original precision.

ADD REPLY • link 8 months ago by Matthias Zepper 4.5k

0

Entering edit mode

Thanks for the bigWig suggestion. So, after I convert my .bed to .bw how do I use bedtools map to calculate coverage over a reference interval .bed file for all the 200 .bw? I can create the interval .bed file using makewindows. I need an output that looks like this:

chr1    256    257    0    4    7    4    1    ....    4
chr1    257    258    2    6    2    1    2    ....    3

Here, no. of columns = 203 (200 for the samples, and 3 for chr, start and stop). I am confused about which options to use in the bedtools map function.

ADD REPLY • link 8 months ago by Agastya ▴ 10

0

Entering edit mode

Sorry, I did not make myself clear: If you have bedGraphs, then you could use bedtools map, with bigwigs rather bigWigAverageOverBed (also available from Bioconda) or wiggletools.

ADD REPLY • link 7 months ago by Matthias Zepper 4.5k

score 3 · Accepted Answer · 2023-08-31

3

Entering edit mode

8 months ago

Alex Reynolds 35k

Ah, a BEDOPS question. Cool. Here's one way to do it, perhaps there are others.

Let's say you have X sort-bed sorted BED3+1 ("bedgraph") files named like so:

A.bg B.bg C.bg ... X.bg

You can map them with the following two lines of code:

$ bedops --merge A.bg B.bg ... N.bg | bedops --chop 1 - > chop.bed
$ paste chop.bed <(bedmap --echo-map-id --unmapped-val 0 chop.bed A.bg) <(bedmap --echo-map-id --unmapped-val 0 chop.bed B.bg) ... <(bedmap --echo-map-id --unmapped-val 0 chop.bed X.bg) > answer.bed

That ... bit in the middle of the paste command is something a simple script would use a loop to create, for however many files you're working with. I'll leave that as an exercise for the reader.

(Note: The limit of X is determined by the Unix system you're using. Without making adjustments to the operating system's kernel settings, this means that generally speaking, in practice, you can run BEDOPS tools on up to 1021 files simultaneously.)

Multibase elements in all your files will be split into single-base elements — whatever is in chop.bed.

If a position or range in a file does not map to single base position, the --unmapped-val 0 option sets that cell's value to zero. This can be a string, too, so nan or NaN or whatever else you'd prefer.

The file answer.bed will be a BED3+N file that has single-base elements from chop.bed. The remaining N columns will be zeroes or whatever maps to the bedgraph at the specified position.

The order in which you specify each <(bedmap...) process will determine the columnar order of results. In other words, if you use A, B, ... X, then the columns will reflect that order.

Because the inputs are sorted, it is all processed in small chunks of memory. It will still take time to run, but I'll wager this will run faster than any other hand-wavy command-line option you're given here that is working with text-based data.

I do recommend first trying this out on a handful of files, so that you understand how this works.

My guess is that putting your data into a database or binary format and storing it in all in memory is better. For the number and size of files you are working with, perhaps that's not a realistic option. You might look at bigWig files but I don't know of any way to do this kind of tabular analysis with a lot of them — not to say it can't be done (maybe look at pyBigWig and Pandas if you have a huge amount of memory), but bigWigs were designed for visualization.

Another option for speeding things up is to use bedextract to efficiently split all your BED (or BED3+1) files by chromosome, and then throw 24 or so jobs on your local computational cluster, each doing the map step on its own chromosome. That will typically reduce your runtime down to however long it takes chr1 to complete, which is the longest chromosome and so usually the largest input.

Good luck!

ADD COMMENT • link 7 months ago by Alex Reynolds 35k

0

Entering edit mode

So, if my chop.bed contains 1bp windows over chr1, would using bedextract give any performance boost? I am thinking, if the chop file contains only chr1, then the program would only compute till chr1 is over, and I don't need bedextract.. This way, I wouldnt have to create more bg files (for every chr for every sample).

ADD REPLY • link 8 months ago by Agastya ▴ 10

0

Entering edit mode

The only reason I suggest to use bedextract in this case would be to easily parallelize whole-genome matricization (chr1, chr2, etc.). If your inputs are all on one chromosome (chr1, say) then there is no need to use bedextract.

You could parallelize summarization though, by working on subranges within a chromosome. This could work well if your data storage allows parallel read/write operations.

In this case, you could split chop.bed by 1Mb increments (or whatever size makes sense for your input) and then apply a map step to each 1Mb increment in its own job.

Once all the jobs are complete, you could concatenate the incremental results into a final BED3+N file, perhaps compressed with starch or bgzip.

ADD REPLY • link 8 months ago by Alex Reynolds 35k

0

Entering edit mode

Hi, so this mostly works on my files. My bedgraphs upon using bedextract --list-chr show only chr1,2,3,X and Y. But when I use grep, I can find chr22 and other chromosomes. I think, due to this issue bedmap --echo-map-id gives me zeroes as output when I try to map my bedgraphs to chromosomes (1bp chopped bed files) other than 1,2,3,X and Y, because it cant find chr22 (or others) in my bedgraphs. Do you know why this might happen or how I can go about solving this? I am able to generate cumulative coverage reports for chr1,2,3 using the method you suggested. Unable to do it for the rest..

ADD REPLY • link 7 months ago by Agastya ▴ 10

0

Entering edit mode

My guess is that data are not sorted or are in unknown sort order that does not work with BEDOPS tools. You can pipe data to sort-bed to fix this; feel free to post an example command and I'll show how to sort correctly, based on what you're working with.

ADD REPLY • link 7 months ago by Alex Reynolds 35k

1

Entering edit mode

Yes, you were right. I sorted the bed files, using sort-bed and then I ran this command on the sorted bed files $ paste chop.bed <(bedmap --echo-map-id --unmapped-val 0 chop.bed A.bg) <(bedmap --echo-map-id --unmapped-val 0 chop.bed B.bg) ... <(bedmap --echo-map-id --unmapped-val 0 chop.bed N.bg) > answer.bed It worked. It is perplexing. I thought that bedtools genomecov usually outputs sorted bed files (which I used to create the bedgraphs). Thanks for the help!

ADD REPLY • link 7 months ago by Agastya ▴ 10

0

Entering edit mode

Glad to hear it worked. Other kits will work with data in whatever order you provide, which is convenient, but the tradeoff is that they run more slowly because they have to read everything into memory. That also requires having sufficient memory, in addition to the speed hit.

This kit will work fast on presorted input; our use of sorting allows using less memory for operations. If you use the bedops kit for operations, you can pipe output from bedops to bedmap and back again, etc. because the sort order is preserved between tools.

ADD REPLY • link 7 months ago by Alex Reynolds 35k

0

Entering edit mode

Examples:

$ sort-bed foo.bg > foo.sorted.bg
$ sort-bed bar.bed > bar.sorted.bed
$ sort-bed <(gunzip -c baz.bed.gz) > baz.sorted.bed
$ gunzip -c baz.bed.gz | sort-bed - | gzip -c > baz.sorted.bed.gz

This will sort data with arbitrary numbers of columns, up to some very long line length that would not typically apply to bedGraph files.

You can integrate unsorted data with bedops and bedmap via piping, e.g.:

$ sort-bed foo.bg | bedops --some-operation - > output.bed
$ sort-bed foo.bg | bedops --some-operation - | bedmap --other-operations - some-map-data.bed > output.bed

The - hyphen can be used for a placeholder to process data from some upstream Unix process (including data coming out of sort-bed or other tools).

ADD REPLY • link 7 months ago by Alex Reynolds 35k