How to Bedtools intersect two different sorted BED files
1
0
Entering edit mode
4.2 years ago
caro-ca ▴ 20

Hey! I'm trying to determine the absence of transposon elements on a non-reference genome by looking at split reads. For that, I mapped the reference genome against a pair of reads using STAR. Consequently, I sorted and indexed the resulting bam file and converted split junction reads with Bam2hints. The output of Bam2hints was a gff file that was converted to a bed file by using gff2bed

gff2bed < myfile.gff > my_sorted_file.bed

Finally, to count if there are hints/junctions spanning a TEs I tried to use Bedtools intersect:

bedtools intersect -a TE_file.bed -b my_sorted_file.bed

(TE_file.bed is the bed file containing the transposon element coordinates)

These 2 Bed files look like this:
- TE_file.bed

 chrI    22231   22552   chrI_s1+_TY1    .       +       reannotate      transposable_element    .       ID=chrI_s1+_TY1
    chrI    138753  138990  chrI_s2-_TY1    .       -       reannotate      transposable_element    .       ID=chrI_s2-_TY1
    chrI    160105  160237  chrI_st1-_TY1   .       -       reannotate      transposable_element    .       ID=chrI_st1-_TY1
    chrI    160238  166163  chrI_u1-_TY1    .       -       reannotate      transposable_element    .       ID=chrI_u1-_TY1
    chrI    182613  182953  chrI_s4+_TY3    .       +       reannotate      transposable_element    .       ID=chrI_s4+_TY3

-my_sorted_file.bed

 2micron 1674    4928    .       0       .       b2h     intron  .       pri=4;src=E
chrI    5028    15053   .       0       .       b2h     intron  .       pri=4;src=E
chrI    6306    7155    .       0       .       b2h     intron  .       mult=2;pri=4;src=E
chrI    6826    16996   .       0       .       b2h     intron  .       pri=4;src=E
chrI    6827    9072    .       0       .       b2h     intron  .       mult=2;pri=4;src=E
chrI    10926   13026   .       0       .       b2h     intron  .       pri=4;src=E
chrI    12053   24485   .       0       .       b2h     intron  .       mult=2;pri=4;src=E
chrI    12061   21076   .       0       .       b2h     intron  .       pri=4;src=E

A piece of the output looks like this:

***** WARNING: File Gallone_hints.bed has inconsistent naming convention for record:
chrI    5028    15053   .   0   .   b2h intron  .   pri=4;src=E

chrI    22231   22552   chrI_s1+_TY1    .   +   reannotate  transposable_element    ID=chrI_s1+_TY1
chrI    138753  138990  chrI_s2-_TY1    .   -   reannotate  transposable_element    ID=chrI_s2-_TY1
chrI    160105  160237  chrI_st1-_TY1   .   -   reannotate  transposable_element    ID=chrI_st1-_TY1
chrI    160238  166163  chrI_u1-_TY1    .   -   reannotate  transposable_element    ID=chrI_u1-_TY1

How can I solve this problem? Which columns should I remove, reorder? Thank you in advance.

bedtools • 1.3k views
ADD COMMENT
2
Entering edit mode
4.2 years ago
ATpoint 82k

I guess the issue is that some chromosomes start with a character and some start with a number. The simplest would be temporarily add any dummy prefix to the chromosome column so that all chromosomes start with a character. Then after the intersect simply remove it again:

bedtools intersect \
  -a <(awk 'OFS="\t" {$1="DUMMYNAME"$1} {print'} file1.bed) \
  -b <(awk 'OFS="\t" {$1="DUMMYNAME"$1} {print'} file2.bed) \
  | awk 'FS=OFS="\t" {gsub("DUMMYNAME","");print}' > output.bed
ADD COMMENT
0
Entering edit mode

Thank you for your reply. Since there was one chromosome starting with a number I just added a character to it. Your code did not work for me so I did the adding manually. This is the stdout:

bedtools intersect -a Gallone_hints.bed -b reference_TE_locations.bed 
    ***** WARNING: File Gallone_hints.bed has inconsistent naming convention for record:
    c2micron    1674    4928    .   0   .   b2h intron  .   pri=4;src=E

    chrI    22231   22552   .   0   .   b2h intron  .   mult=2;pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   mult=3;pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   mult=3;pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   pri=4;src=E
    chrI    22231   22552   .   0   .   b2h intron  .   mult=6;pri=4;src=E
    chrXVI  856550  856747  .   0   .   b2h intron  .   mult=3;pri=4;src=E
    chrXVI  850624  856550  .   0   .   b2h intron  .   mult=3;pri=4;src=E
    chrXVI  856550  856881  .   0   .   b2h intron  .   mult=22;pri=4;src=E
    chrXVI  850624  856550  .   0   .   b2h intron  .   mult=22;pri=4;src=E
    chrXVI  933066  933400  .   0   .   b2h intron  .   pri=4;src=E
    chrXVI  933538  933675  .   0   .   b2h intron  .   pri=4;src=E
    chrXVI  933066  933400  .   0   .   b2h intron  .   pri=4;src=E
    chrXVI  933538  933675  .   0   .   b2h intron  .   pri=4;src=E
    chrXVI  933538  933675  .   0   .   b2h intron  .   pri=4;src=E
    chrXVI  937875  937992  .   0   .   b2h intron  .   pri=4;src=E
    ***** WARNING: File Gallone_hints.bed has inconsistent naming convention for record:
    c2micron    1674    4928    .   0   .   b2h intron  .   pri=4;src=E
ADD REPLY
1
Entering edit mode

The making it chr2micron, so have the same prefix everywhere, so chr.

ADD REPLY
0
Entering edit mode

Thank you for your help and time! It worked.

ADD REPLY

Login before adding your answer.

Traffic: 2308 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6