Question

Bedtools intersect alternatives available

0

Entering edit mode

6.7 years ago

Jeffin Rockey ★ 1.3k

What are the alternative tools available which can do what bedtools intersect does?

bedtools • 5.4k views

ADD COMMENT • link updated 6.7 years ago by Alex Reynolds 36k • written 6.7 years ago by Jeffin Rockey ★ 1.3k

1

Entering edit mode

and why wouldn't you use "bedtools intersect" ?

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

There are 10+ methylome bedfiles which have the positions of methylated bases. They are obtained through bwa-meth followed by methyldackel followed by conversion to bed. Roughly each file has 80,000,000+ entries. I was trying to intersect these with gene features like exons. The count came as expected. The problem is that I have around 50,000 genes. Bedtools intersect is taking approximately 5 minutes say in total for say exon, intron, promoter of one gene. If I extrapolate for 50,000 genes at that scale it would require weeks to get the intersection completed. (Tried bedops also but counts were different than with bedtools. )

I have been an avid user of bedtools ever since. But in this case even with sorted beds, I could not achieve the necessary speed.

That is why I asked for other alternatives.

ADD REPLY • link 6.7 years ago by Jeffin Rockey ★ 1.3k

1

Entering edit mode

That's a better description about the problem you are trying to solve.

Some ideas:

Use tabix to index your bed file. Doing this you can have random access to given regions.
Think about splitting your regions you want to intersect with and use gnu parallel

fin swimmer

ADD REPLY • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

I see . How about parallelizing things per exon ?

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

When I tried to ‎ parallelize bedtools, they are in fact individually slowing down effectively nullifying the expected advantage. I checked in different servers[256 GB], but this behaviour is recurring, may be something to do with RAM.

ADD REPLY • link 6.7 years ago by Jeffin Rockey ★ 1.3k

0

Entering edit mode

Please confirm whether you are using bedtools intersect -sorted.

ADD REPLY • link 6.7 years ago by John Marshall 3.1k

0

Entering edit mode

John,

I remember so. Let me cross check again. Shall confirm on this at the earliest.

ADD REPLY • link 6.7 years ago by Jeffin Rockey ★ 1.3k

0

Entering edit mode

John,

Confirmed. With -sorted itself it is slow.

Jeffin

ADD REPLY • link 6.7 years ago by Jeffin Rockey ★ 1.3k

0

Entering edit mode

It depens on what you want to "intersect".

ADD REPLY • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

But what is the purpose? For some of the task "grep -f" or "join" command can also be used.

ADD REPLY • link 6.7 years ago by Tm ★ 1.1k

0

Entering edit mode

A good read

What Is The Proper Way To Think About Reinventing The Wheel As A Bioinformatician?

ADD REPLY • link 6.7 years ago by lakhujanivijay 5.9k

0

Entering edit mode

If this is whole-genome sequencing, I recommend first running a basic intersect (either with bedtools or bedops) with just the gene regions, keeping only those bases that overlap with genes for the more detailed annotation tasks looking at exons, introns, etc. Odds are you want to separate the based into gene-overlapping and non-gene-overlapping anyway. You may also want to consider splitting the intersect for either type of annotation, i.e., run a separate process with a bed file only containing exons or introns, respectively.

ADD REPLY • link 6.7 years ago by Friederike 9.0k

score 0 · Answer 1 · 2018-10-16

0

Entering edit mode

6.7 years ago

Alex Reynolds 36k

You would need sorted inputs (sorted per sort-bed, not sure what sortBed does), but for faster options, bedops --intersect and bedops --element-of do different kinds of intersections.

If you're counting overlaps of elements by class: bedmap --count and bedmap --faster --count can be useful.

You can also use the --chrom operator with BEDOPS tools to trivially parallelize work by chromosome via GNU Parallel or HPC job schedulers.

ADD COMMENT • link 6.7 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks Alex. I shall try with the mentioned options and see if they can be used instead.

ADD REPLY • link 6.7 years ago by Jeffin Rockey ★ 1.3k

0

Entering edit mode

If you're not working with nested elements, --faster can speed things up even more than the usual bedops speedup. See the docs for a more detailed explanation of what nested elements are and if they can be used here.

ADD REPLY • link 6.7 years ago by Alex Reynolds 36k

0

Entering edit mode

Alex, Even without --faster bedops was significantly faster and that speedup itself was sufficient. But the difference in results of --element-of 1 from bedtools intersect is what held me back from using bedops.

ADD REPLY • link 6.7 years ago by Jeffin Rockey ★ 1.3k

0

Entering edit mode

I don't know what overlap criteria that bedtools uses as a default, but --element-of 1 is one or more bases of overlap. More stringent overlap can be specified with more bases or by using percentage, i.e. --element-of 100% for full enclosure. Also check that inputs are sorted, and that inputs are provided in the correct order, i.e. bedops -e 1 A B will give a different answer from bedops -e 1 B A

ADD REPLY • link 6.7 years ago by Alex Reynolds 36k

0

Entering edit mode

is there a difference in how bedtools and bedops interpret the intervals? (i.e., zero-based half open vs. 1-based etc.)

ADD REPLY • link 6.7 years ago by Friederike 9.0k

0

Entering edit mode

Bedops works correctly with half-open, 0-based indexing. Not sure what other tools do.

ADD REPLY • link 6.7 years ago by Alex Reynolds 36k