Ah, a BEDOPS question. Cool. Here's one way to do it, perhaps there are others.
Let's say you have X
sort-bed sorted BED3+1 ("bedgraph") files named like so:
A.bg B.bg C.bg ... X.bg
You can map them with the following two lines of code:
$ bedops --merge A.bg B.bg ... N.bg | bedops --chop 1 - > chop.bed
$ paste chop.bed <(bedmap --echo-map-id --unmapped-val 0 chop.bed A.bg) <(bedmap --echo-map-id --unmapped-val 0 chop.bed B.bg) ... <(bedmap --echo-map-id --unmapped-val 0 chop.bed X.bg) > answer.bed
... bit in the middle of the
paste command is something a simple script would use a loop to create, for however many files you're working with. I'll leave that as an exercise for the reader.
(Note: The limit of X is determined by the Unix system you're using. Without making adjustments to the operating system's kernel settings, this means that generally speaking, in practice, you can run BEDOPS tools on up to 1021 files simultaneously.)
Multibase elements in all your files will be split into single-base elements — whatever is in
If a position or range in a file does not map to single base position, the
--unmapped-val 0 option sets that cell's value to zero. This can be a string, too, so
NaN or whatever else you'd prefer.
answer.bed will be a BED3+N file that has single-base elements from
chop.bed. The remaining N columns will be zeroes or whatever maps to the bedgraph at the specified position.
The order in which you specify each
<(bedmap...) process will determine the columnar order of results. In other words, if you use
X, then the columns will reflect that order.
Because the inputs are sorted, it is all processed in small chunks of memory. It will still take time to run, but I'll wager this will run faster than any other hand-wavy command-line option you're given here that is working with text-based data.
I do recommend first trying this out on a handful of files, so that you understand how this works.
My guess is that putting your data into a database or binary format and storing it in all in memory is better. For the number and size of files you are working with, perhaps that's not a realistic option. You might look at bigWig files but I don't know of any way to do this kind of tabular analysis with a lot of them — not to say it can't be done (maybe look at
pyBigWig and Pandas if you have a huge amount of memory), but bigWigs were designed for visualization.
Another option for speeding things up is to use
bedextract to efficiently split all your BED (or BED3+1) files by chromosome, and then throw 24 or so jobs on your local computational cluster, each doing the map step on its own chromosome. That will typically reduce your runtime down to however long it takes
chr1 to complete, which is the longest chromosome and so usually the largest input.