Question

Bed Files Comparison

1

Entering edit mode

14.0 years ago

Amyemilie ▴ 10

Hi,

I have two bed files : the first (1) one determinate features on my genome and the second one (2) is the output of the sequencing after cleavage on the same genome. I want to know if the sequences of (2) are contained in the features of (1). If they are not, I want to print the sequence in an output file (3). If they are I want to print the half of the sequences in the same output file (3).

I did a perl script but it takes 25hours to run because I have 16millions sequences in (2). So I was wondering if someone could give me some clue to build a bash script, shorter and faster than a perl script.

bed bash comparison • 4.7k views

ADD COMMENT • link updated 14.0 years ago by brentp 24k • written 14.0 years ago by Amyemilie ▴ 10

Ram · Answer 1 · 2011-07-27

This is essentially a question about fast interval overlap detection, a topic which has been addressed here several times. See the answers to:

This sounds like a candidate for BEDTools. First, use intersectBed to find which features in BED file2 are contained in a feature from BED file 1. Then run it again using the -v option to get the inverse set of features in BED file 2. You can extract the sequences of the resulting features from a reference using fastaFromBed. I'm not sure what you mean by "print the half of the sequences in the same output file", but the BEDTools programs will deal with the hard part efficiently (minutes rather than 25 hours!) and you can use some Perl to tidy up.

Ram · Answer 2 · 2011-07-27

If I understand correctly (may need more clarification of your 2nd file--are they mapped?), and your sequences are already mapped, you can do something like the following with bedtools 1:

$ intersectBed -a seqs.bed2 -b features.bed1 -wa -v -f 1 > no_overlap
$ intersectBed -a seqs.bed2 -b features.bed1 -wa -f 1 > full_overlap

where your seqs.bed2 will be your (2) and in a tab-delimited format like

#chrom start end sequence
chr1 1234 4567 ACTCTGAC....

and features.bed1 is your (1) in similar BED format.

You may have to adjust the exact calls to intersectBed, but see the help after installation, it may do what you need. Then if you just need the sequence, you can extract that after.

If your sequences (2) are not mapped, use a short-read aligner like BWA or bowtie to map them to the sequences in (1)