Efficiently find regions in bed file that are within start and end regions of another bed file.
1
0
Entering edit mode
4 months ago
rxs1018 • 0

If I have two bed files:

A.bed

chr1 100 200 SVID1  
chr2 1300 2100 SVID2  
chr5 3000 4500 SVID3 

B.bed

chr1 40 210  
chr2 1330 2901  
chr5 3059 4430  

I would like to find the features in B that have both the start and end position within 100bp of the start and end positions of A.bed, and label with the SVID.

Therefore in this case, the output would be:

C.bed

chr1 40 210 SVID1  
chr5 3059 4430 SVID3  

Is there an easy and/or efficient way to do this? Preferably in bedtools, but I do not mind the tool used

intersect bedtools • 822 views
ADD COMMENT
0
Entering edit mode

bedtools intersect : https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

Curious as to how/why you did not find this tool yourself. You even included intersect as a tag for post.

ADD REPLY
0
Entering edit mode

I'm familiar with using bedtools for finding intersections, but my understanding is that it is only capable of computing intersection/fraction of intersection rather than check whether start and end are within n base pairs?

ADD REPLY
0
Entering edit mode

Thank you for clarifying that.

While there is bedtools window : https://bedtools.readthedocs.io/en/latest/content/tools/window.html it looks like you may be best served by using a custom program. Ask your favorite LLM for assistance if you don't program or I can post a solution suggested by GPT.


import pandas as pd
import sys

def main(file1_path, file2_path):
    # Load first BED file (with name column)
    df1 = pd.read_csv(file1_path, sep='\t', header=None, names=["chrom", "start", "end", "name"])

    # Load second BED file (no name column expected)
    df2 = pd.read_csv(file2_path, sep='\t', header=None, names=["chrom", "start", "end"])

    matches = []

    for _, row2 in df2.iterrows():
        chrom_matches = df1[df1["chrom"] == row2["chrom"]]
        for _, row1 in chrom_matches.iterrows():
            if abs(row2["start"] - row1["start"]) <= 100 and abs(row2["end"] - row1["end"]) <= 100:
                matches.append({
                    "chrom": row2["chrom"],
                    "start": row2["start"],
                    "end": row2["end"],
                    "matched_name": row1["name"]
                })
                break  # Stop at first match

    # Convert to DataFrame and print
    matched_df = pd.DataFrame(matches)
    print(matched_df.to_csv(sep='\t', index=False, header=False))

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python bed_matcher.py <first_bed_with_name> <second_bed>")
        sys.exit(1)

    file1 = sys.argv[1]
    file2 = sys.argv[2]

    main(file1, file2)

Here is how you would run the program. (files as shown in the example in original question) :

python bed_matcher.py file1.bed file2.bed
ADD REPLY
0
Entering edit mode

One way is to use BEDOPS bedmap:

bedmap --echo --echo-map-id --range 100 --delim '\t' B.bed A.bed > answer.bed

Via: https://github.com/bedops/bedops

Documentation for bedmap is available at: https://bedops.readthedocs.io/en/latest/content/reference/statistics/bedmap.html

ADD REPLY
2
Entering edit mode
4 months ago

a one liner (extends the original file2bed by X=100bp):

 awk '{X=100;B=int($2);E=int($3);printf("%s\t%d\t%s\t%d\t%d\n",$1,(B>=X?B-X:0),E+X,B,E);}' file2.bed |\
 bedtools intersect -a - -b file1.bed -wa -wb |\
 awk -F '\t' '{N=int($4)-int($7); if(N<0) N=-N; if(N> 100) next; N=int($5)-int($8); if(N<0) N=-N; if(N> 100) next;print;}' |\
 cut -f 1,4,5,9

chr1    40  210 SVID1
chr5    3059    4430    SVID3
ADD COMMENT

Login before adding your answer.

Traffic: 3174 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6