Question: Combining multiple bedfiles
2
gravatar for christacaggiano
2.2 years ago by
UCSF
christacaggiano50 wrote:

I have many (>100) bed files in this format

file1

chr1    1   2    2
chr1    10  11    3
chr1    50  51   4

file2

chr1    1   2   10
chr1    10  11   8
chr10   2   3   8

fileN

chr1    1   2   1
chr1    50  51   2
chr10   2   3   9

where the files have some sites in common, but not all of them.

My goal is to obtain a file that looks like this

fileN

chrm start end file1 file2 fileN
chr1    1   2   2    10    1
chr1    10  11  3    8      NA
chr1    50  51  4    NA   2   
chr10   2   3   NA  2      9

I want to be able to do this efficiently for all 100 files that I have. I was thinking about using bedtools but don't want to use many many lines of intersections. The files are quite large as well. Does anyone have a suggestion on how to do this? Thanks!

bed bedtools • 1.2k views
ADD COMMENTlink modified 11 weeks ago by Biostar ♦♦ 20 • written 2.2 years ago by christacaggiano50
5
gravatar for Alex Reynolds
2.2 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

BEDOPS bedops and bedmap are efficient in memory and time on whole-genome scale files. These tools also work with standard input and output streams, which allows chaining of operations and integration of set operations with standard Unix tools. These features make work on hundreds of files tractable.

Method 1

If you can get away without the NA, given sorted single-base BED files A.bed, B.bed through N.bed (N=100 or whatever), you could take their union and their merge (these are all stored under the directory files):

$ bedops -u files/A.bed files/B.bed ... files/N.bed > union.bed
$ bedops -m files/A.bed files/B.bed ... files/N.bed > merge.bed

Or if you want to work on all the files in this directory, more simply:

$ bedops -u files/*.bed > union.bed
$ bedops -m files/*.bed > merge.bed

Then map the IDs in the union file to the intervals in the merge file:

$ bedmap --echo --echo-map-id --delim '\t' merge.bed union.bed > answer.bed

The file answer.bed would look like:

chr1    1   2   1;10;2
chr1    10  11  3;8
chr1    50  51  2;4
chr10   2   3   8;9

The --multidelim '\t' option could be added, but using the default semi-colon delimiter may make it easier to demonstrate the result. In this approach, there are no NA values and ID values are presented in lexicographical order, not the order of original input files.

Method 2

If you need an NA value positioned in order from columns 4 through N+3, where ID values are missing, then one option is to loop over all N BED files to generate N per-file maps:

$ for fn in `ls files/*.bed`; do echo $fn; bedops -n 1 merge.bed $fn | awk '{ print $0"\tNA" }' | bedops -u - $fn | cut -f4 > $fn.map; done

The pipeline of bedops commands here uses merge.bed to fill in gaps where intervals are missing, adding NA as the ID value for missing intervals before doing the final mapping step with cut. At the end, each of the per-file maps is a column in the result you ultimately want.

Finally, use paste to glue all these columns together into an N+3 column matrix:

$ paste merge.bed files/*.bed.map > answer.mtx

The file answer.mtx will contain the intervals in the first three columns, and ID values from columns 4 through N+3, for input files A.bed through N.bed. It'll look like this:

chr1    1   2   2   10  1
chr1    10  11  3   8   NA
chr1    50  51  4   NA  2
chr10   2   3   NA  8   9

Basically, this is the result you want, but without the chrm ... fileN header at the top.

Unix pipelines demo'ed in this answer will make all of this work go faster, by avoiding generating intermediate files as much as possible.

If you have access to a compute cluster and each file is very large, or you have many hundreds of files, and you need to automate this process, each iteration or step in the serial per-map loop can be parallelized as a separate job on the cluster. Then you have a final, "dependent" job that does the paste step at the end, in map-reduce fashion.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Alex Reynolds30k

I'm trying to wrap my head around the difference between the Union (-u) and Merge (-m). But even looking at the docs it seems both do the same in practical terms. Why do you need both?

And also, how do you keep the file names at the top of the columns, as the OP asked?

Thanks a lot in advance

ADD REPLYlink modified 14 months ago • written 14 months ago by fr130
1

The documentation includes images to visually explain the difference between different operations:

• https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html • https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html#everything-u-everything • https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html#merge-m-merge

Basically, the difference is that -u puts input elements into the output set, without changing them. The -m operation creates new elements by merging the input intervals.

ADD REPLYlink written 14 months ago by Alex Reynolds30k

Thanks so much! Do you see a straightforward way to keep the file names as feature names? (i.e. column names for columns 4+). Asking this for the case with many input files. For now, I'm processing everything and at the end I'm adding a new line with file names, but still need to manually make sure the columns match

ADD REPLYlink modified 14 months ago • written 14 months ago by fr130

I don't know. It will depend on your inputs and I don't really understand what you are doing. Maybe read through the docs I linked to, in order to see which operators best meet your needs, and then explore?

ADD REPLYlink written 14 months ago by Alex Reynolds30k
0
gravatar for finswimmer
2.2 years ago by
finswimmer13k
Germany
finswimmer13k wrote:

Hello,

you could use a slightly modified version of my python script in this answer:

import glob
import sys
import pandas


def read_filenames(names):
    for arg in names:
        if "*" in arg:
            for file in glob.glob(arg):
                yield file
        else:
            yield arg


data = []

for sample in read_filenames(sys.argv[1:]):
    frame = pandas.read_csv(
        sample,
        sep="\t",
        header=None,
        names=["chr", "start", "end", sample],
        index_col=[0, 1, 2]
    )
    data.append(frame)

data_joined = data[0].join(data[1:], how='outer')

print(data_joined.to_csv(
    sep="\t",
    na_rep="NA",
    float_format='%.0f'
    ))

Run it like this:

$ python join.py File* > result.csv

fin swimmer

ADD COMMENTlink written 2.2 years ago by finswimmer13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1381 users visited in the last hour