Question: Does multiIntersectBed works with -u and -f parameter?
1
gravatar for ivivek_ngs
4.2 years ago by
ivivek_ngs4.8k
Seattle,WA, USA
ivivek_ngs4.8k wrote:

Does the multiIntersectBed function of the bedtools work with the parameter of fraction (-f). I am trying to intersect 4 bed files together to see if they have overlaps together but minimum bases should be around 50% , am not interested in single base match. I am interested in finding if 50% or above of bases in each file match among all bed files giving me the pattern showing the listing of the column where I can see all the labels together. I am trying to do something like this but it does not seem to work. if anyone can give me some suggestions how to get this done? It seems that -u and -f does not work here. Is there a way I can do that? I tried BEDOPS as well, not that much of a help. I would appreciate some input.

multiIntersectBed -i file1.bed file2.bed file3.bed file4.bed -header -names A B C D -header -u -f .50| head

 

Regards

ADD COMMENTlink modified 4.2 years ago by Aaronquinlan10k • written 4.2 years ago by ivivek_ngs4.8k

Although it is probable that someone in here will know the answer, you are more likely to get answers for your bedtools questions in the googlegroups for bedtools. The developers are usually very quick to reply to queries, and often questions have been asked (and answered) before in there.
 

ADD REPLYlink written 4.2 years ago by A. Domingues2.0k
3
gravatar for Aaronquinlan
4.2 years ago by
Aaronquinlan10k
United States
Aaronquinlan10k wrote:

You can't do this directly with bedtools multiinter, but you can answer this by combing a few tools. It is a smidge complicated, but it should work and it demonstrates how combining multiple bedtools can usually get you the answer you need. 

    ➜ cat a.bed
    chr1    15    20    a1
    chr1    18    25    a2   
    ➜ cat b.bed
    chr1    16    20    b
    ➜ cat c.bed
    chr1    17    21    c

Step 1: Find intervals common to all three files. The fourth column of the output is the number of datasets present at the interval, so it can be used as a filtering criterion.

   ➜ bedtools multiinter -i a.bed b.bed c.bed | awk '$4 == 3' > common.bed
   ➜ cat common.bed
   chr1    17    21    3    1,2,3    1    1    1

Step 2. The issue now is that you don't know the degree of overlap these intervals have with the original intervaks in each file. So, we will now just use the intersect tool to intersect the common intervals with the original 3 files while requiring 50% overlap.

    ➜ bedtools intersect -a common.bed \
                          -b a.bed b.bed c.bed \
                          -f 0.50 -r \
                          -wa -wb \
        > common.with.originals.bed
    ➜ cat common.with.originals.bed
    chr1  17  21  3   1,2,3  1  1  1  1  chr1  15  20  a1
    chr1  17  21  3   1,2,3  1  1  1  1  chr1  18  25  a2
    chr1  17  21  3   1,2,3  1  1  1  2  chr1  16  25  b
    chr1  17  21  3   1,2,3  1  1  1  3  chr1  17  21  c

Step 3. Now we have to filter the output from step 2 to ensure that the "common" intervals indeed overlapped with all 3 datasets. Conveniently, when using the -wb option with multiple "B" files, the intersect tool reports the dataset from which the hit came. In this case, that will be the 9th column. So, we just need to test that each interval from step 2 had hits from all three input files. We can do this with the groupby tool. 

    ➜ bedtools groupby -g 1-8 -c 9 -o count_distinct -i common.with.originals.bed
    chr1  17  21  3   1,2,3   1   1   1   3

Since the last column now reflect the count of distinct datasets that overlapped the common interval at 50%, you can use awk to filter for only those intervals wheere the count is 3.

    ➜ bedtools groupby -g 1-8 -c 9 -o count_distinct -i common.with.originals.bed \
      | awk '$9==3'
    chr1  17  21  3   1,2,3   1   1   1   3
ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Aaronquinlan10k

Aaronquinlan thank you for this, I seem to be having problems with one of the bedtools command I have version bedtools v2.26.0 I ran the same commands and the line below is giving me problems: < bedtools groupby -g 1-8 -c 9 -o count_distinct -i common.with.originals.bed > the output was just : 3

Would greatly appreciate your help, thanks, Linda

ADD REPLYlink written 2.7 years ago by molla.linda60
0
gravatar for Alex Reynolds
4.2 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

I tried BEDOPS as well, not that much of a help. I would appreciate some input.

There is documentation to explain how to find reciprocal overlap between multiple inputs. You could do:

$ bedops -u file1.bed file2.bed file3.bed file4.bed \
    | bedmap --echo --echo-map-id-uniq --fraction-both 0.5 - \
    | awk -F"|" '(split($2, a, ";") > 1)' \
    > answer.bed

If this is not what you are trying to do, please provide more detail.

ADD COMMENTlink written 4.2 years ago by Alex Reynolds28k

Alex Reynolds , I tried the above command. The thing is I have found CNV regions of my data and among this regions I have found promoter regions , methylation sites and CpG sites. Now I have extracted the genomic coordinates of all these sites in form of bed files. The promoter , methylation and CpG sites are hits from UCSC. Now I want to take all the 4 files and do an overlap to see if at any region there is a common overlap among all the four files but it should not be at 1 base region since my file1.bed is a CNV bed file of my sample. So I want to see at any point in the file1.bed are there few bases that will correspond to similar bases in file.bed file3.bed and file4.bed. But it should span a few bases not just a single bases overlap. I hope things make sense now. So I want overlap should be among all the 4 bed files together spanning atleast more than 1 bases and by that it would even work if 30% of my bases in CNV which is file1.bed gets overlap at the sametime with all the other 3 files for same bases. I  hope it is a bit clear now.

ADD REPLYlink written 4.2 years ago by ivivek_ngs4.8k
1

The --fraction-both option designates, at minimum, how much the two elements (from any set) should overlap. If this option is left out, then the minimum overlap criteria is one base. But by including the --fraction-both option, you override the one base overlap criterion with the custom fractional preference.

If you are instead measuring cardinality — or how many elements overlap between sets — and not the extent of interval overlap, then you could do something like bedmap --indicator and awk between pairs of sets, to get the fraction of cardinal-overlap:

$ bedmap --indicator A B | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_AB.txt 
$ bedmap --indicator B A | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_BA.txt
$ bedmap --indicator A C | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_AC.txt
$ bedmap --indicator C A | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_CA.txt
$ bedmap --indicator A D | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_AD.txt
$ bedmap --indicator D A | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_DA.txt
$ bedmap --indicator B C | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_BC.txt
$ bedmap --indicator C B | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_CB.txt
$ bedmap --indicator B D | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_BD.txt
$ bedmap --indicator D B | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_DB.txt
$ bedmap --indicator C D | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_CD.txt
$ bedmap --indicator D C | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_DC.txt

Basically, for N sets, you are doing N(N-1) comparisons (a set will have 100% concordance with its own elements).

Once you have these results, you can put them into an N x N matrix and then perhaps plot a heatmap. You can plot numbers in cells to quickly see fraction of cardinal-overlap and determine if you get over 50% concordance between all sets. Or you can just programmatically read out each of the rows or columns and see if the fractional threshold is met.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Alex Reynolds28k

am confused with "'(split($2, a, ";") > 1)'" part. I will for sure give it a try but I used bedops for the first time today. Just to make sure with the command of bedops and bedmap can I achieve what am looking? like an overlap of genomic interval from all 4 files together?

ADD REPLYlink written 4.2 years ago by ivivek_ngs4.8k

It might help to put in tee statements so that you see what gets piped from one step to the next.

ADD REPLYlink written 4.2 years ago by Alex Reynolds28k
1

A simpler option may also be to take the multiset union of the inputs to make a "master element list", and do the bedmap --indicator and awk step against this master list for each of the four sets:

$ bedops --everything A B C D > Master
$ bedmap --indicator A Master | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_A_Master.txt
$ bedmap --indicator B Master | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_B_Master.txt
$ bedmap --indicator C Master | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_C_Master.txt
$ bedmap --indicator D Master | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_D_Master.txt

You could then check if the fraction_* files are all greater than 50%.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1230 users visited in the last hour