Combining multiple bedfiles
Entering edit mode
3.0 years ago

I have many (>100) bed files in this format


chr1    1   2    2
chr1    10  11    3
chr1    50  51   4


chr1    1   2   10
chr1    10  11   8
chr10   2   3   8


chr1    1   2   1
chr1    50  51   2
chr10   2   3   9

where the files have some sites in common, but not all of them.

My goal is to obtain a file that looks like this


chrm start end file1 file2 fileN
chr1    1   2   2    10    1
chr1    10  11  3    8      NA
chr1    50  51  4    NA   2   
chr10   2   3   NA  2      9

I want to be able to do this efficiently for all 100 files that I have. I was thinking about using bedtools but don't want to use many many lines of intersections. The files are quite large as well. Does anyone have a suggestion on how to do this? Thanks!

bed bedtools • 1.7k views
Entering edit mode
3.0 years ago

BEDOPS bedops and bedmap are efficient in memory and time on whole-genome scale files. These tools also work with standard input and output streams, which allows chaining of operations and integration of set operations with standard Unix tools. These features make work on hundreds of files tractable.

Method 1

If you can get away without the NA, given sorted single-base BED files A.bed, B.bed through N.bed (N=100 or whatever), you could take their union and their merge (these are all stored under the directory files):

$ bedops -u files/A.bed files/B.bed ... files/N.bed > union.bed
$ bedops -m files/A.bed files/B.bed ... files/N.bed > merge.bed

Or if you want to work on all the files in this directory, more simply:

$ bedops -u files/*.bed > union.bed
$ bedops -m files/*.bed > merge.bed

Then map the IDs in the union file to the intervals in the merge file:

$ bedmap --echo --echo-map-id --delim '\t' merge.bed union.bed > answer.bed

The file answer.bed would look like:

chr1    1   2   1;10;2
chr1    10  11  3;8
chr1    50  51  2;4
chr10   2   3   8;9

The --multidelim '\t' option could be added, but using the default semi-colon delimiter may make it easier to demonstrate the result. In this approach, there are no NA values and ID values are presented in lexicographical order, not the order of original input files.

Method 2

If you need an NA value positioned in order from columns 4 through N+3, where ID values are missing, then one option is to loop over all N BED files to generate N per-file maps:

$ for fn in `ls files/*.bed`; do echo $fn; bedops -n 1 merge.bed $fn | awk '{ print $0"\tNA" }' | bedops -u - $fn | cut -f4 > $; done

The pipeline of bedops commands here uses merge.bed to fill in gaps where intervals are missing, adding NA as the ID value for missing intervals before doing the final mapping step with cut. At the end, each of the per-file maps is a column in the result you ultimately want.

Finally, use paste to glue all these columns together into an N+3 column matrix:

$ paste merge.bed files/* > answer.mtx

The file answer.mtx will contain the intervals in the first three columns, and ID values from columns 4 through N+3, for input files A.bed through N.bed. It'll look like this:

chr1    1   2   2   10  1
chr1    10  11  3   8   NA
chr1    50  51  4   NA  2
chr10   2   3   NA  8   9

Basically, this is the result you want, but without the chrm ... fileN header at the top.

Unix pipelines demo'ed in this answer will make all of this work go faster, by avoiding generating intermediate files as much as possible.

If you have access to a compute cluster and each file is very large, or you have many hundreds of files, and you need to automate this process, each iteration or step in the serial per-map loop can be parallelized as a separate job on the cluster. Then you have a final, "dependent" job that does the paste step at the end, in map-reduce fashion.

Entering edit mode

I'm trying to wrap my head around the difference between the Union (-u) and Merge (-m). But even looking at the docs it seems both do the same in practical terms. Why do you need both?

And also, how do you keep the file names at the top of the columns, as the OP asked?

Thanks a lot in advance

Entering edit mode

The documentation includes images to visually explain the difference between different operations:

• • •

Basically, the difference is that -u puts input elements into the output set, without changing them. The -m operation creates new elements by merging the input intervals.

Entering edit mode

Thanks so much! Do you see a straightforward way to keep the file names as feature names? (i.e. column names for columns 4+). Asking this for the case with many input files. For now, I'm processing everything and at the end I'm adding a new line with file names, but still need to manually make sure the columns match

Entering edit mode

I don't know. It will depend on your inputs and I don't really understand what you are doing. Maybe read through the docs I linked to, in order to see which operators best meet your needs, and then explore?

Entering edit mode
3.0 years ago


you could use a slightly modified version of my python script in this answer:

import glob
import sys
import pandas

def read_filenames(names):
    for arg in names:
        if "*" in arg:
            for file in glob.glob(arg):
                yield file
            yield arg

data = []

for sample in read_filenames(sys.argv[1:]):
    frame = pandas.read_csv(
        names=["chr", "start", "end", sample],
        index_col=[0, 1, 2]

data_joined = data[0].join(data[1:], how='outer')


Run it like this:

$ python File* > result.csv

fin swimmer


Login before adding your answer.

Traffic: 2513 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6