overlap gene and exon with bed file
1
0
Entering edit mode
8.8 years ago
bioguy24 ▴ 230

I have a file (gene_exons.bed) that I am trying to use to map a bed file with baits to. That is use the bait bed to find if it overlaps the gene_exon.bed and if it does, output the result.

bedmap --echo --echo-map-id-uniq epilepsy70_medex_edit.bed output.bed > answer.bed
bedmap --echo --echo-map epilepsy70_medex_edit.bed output.bed > gene_exon.bed

I have tried both commands above but can not produce the desired result. Is it possible? Thank you :)

gene_exon.bed

chr1    11868    12227    DDX11L1     1
chr1    12009    12057    DDX11L1     1
chr1    12178    12227    DDX11L1     2
chr1    12612    12697    DDX11L1     3
chr1    12612    12721    DDX11L1     2
chr1    12974    13052    DDX11L1     4
chr1    13220    13374    DDX11L1     5
chr1    13220    14409    DDX11L1     3
chr1    13452    13670    DDX11L1     6

epilepsy70_medex_edit.bed (baits)

chr1    40539722    40539865
chr1    40542489    40542609
chr1    40544221    40544341
chr1    40546054    40546174
chr1    40555071    40555194
chr1    40556976    40557096
chr1    40557706    40557854
chr1    40558059    40558189
chr1    40562776    40562920
chr1    43392701    43392922

Desired output

chr1    40562776    40562920   gene   exon
chr1    43392701    43392922   gene   exon
bedops • 2.8k views
ADD COMMENT
0
Entering edit mode

If you need to validate BED files, you can use bedops --ec --everything:

$ bedops --ec --everything gene_exon.bed > /dev/null
$ bedops --ec --everything epilepsy70_medex_edit.bed > /dev/null

This will tell you if a BED file is missing information, has malformed fields, or is not sorted correctly (i.e., per BEDOPS sort-bed).

If you need results in a different format, you can use bedmap --echo-map-id, but instead pre-process your inputs to contain both gene and exon number in the ID field.

When mapping with bedmap, the ID field will contain both gene name and exon number and so the output will contain that mapped result.

Otherwise, it's not immediately clear to me exactly what you are trying to do with your inputs. In that case, if you want to post a snippet of both input files epilepsy70_medex_edit.bed and output.bed, as well as a snippet of the expected output file answer.bed, that might help me help write you an awk statement that does the necessary pre-processing.

I don't think either BEDOPS or bedtools will do what you want without some extra work, but please feel free to post more info and I'll try to help show you how to do things with BEDOPS.

ADD REPLY
0
Entering edit mode

The below command:

bedmap --echo --echo-map-id-uniq epilepsy70_medex_edit.bed gene_exon.bed > answer.bed

creates a answer.bed that is close (that is it includes the gene name but not the exon):

answer.bed

chr19    50365271    50365391 |
chr19    50365430    50365550 |NAPSA
chr19    50365614    50365734 |NAPSA

Basically, I am just trying to ge tthe exon number included. I do have a file (gene_exon.bed) contains that info.

epilepsy70_medex_edit.bed (baits)

chr1    40539722    40539865
chr1    40542489    40542609
chr1    40544221    40544341
chr1    40546054    40546174
chr1    40555071    40555194
chr1    40556976    40557096
chr1    40557706    40557854
chr1    40558059    40558189
chr1    40562776    40562920
chr1    43392701    43392922

Desired answer.bed

chr1    40562776    40562920   gene   exon
chr1    43392701    43392922   gene   exon

Thank you very much :). I will also try to concatenate all the values in ID)

ADD REPLY
0
Entering edit mode

Can the below command be used to identify the gene and exon of a given overlap? I have a file with the gene:exon concatenated in column 4, but the results (as of now) do not seem to be using this file. Thank you :).

bedmap --echo --skip unmapped --echo-map-id-uniq epilepsy70_medex_edit.bed test.bed > answer.bed

test.bed

chr1    11868    12227    DDX11L1:1
chr1    12009    12057    DDX11L1:1
chr1    12178    12227    DDX11L1:2
chr1    12612    12697    DDX11L1:3
chr1    12612    12721    DDX11L1:2
chr1    12974    13052    DDX11L1:4
chr1    13220    13374    DDX11L1:5
chr1    13220    14409    DDX11L1:3
chr1    13452    13670    DDX11L1:6

epilepsy70_medex_edit.bed (baits)

chr1    40539722    40539865
chr1    40542489    40542609
chr1    40544221    40544341
chr1    40546054    40546174
chr1    40555071    40555194
chr1    40556976    40557096
chr1    40557706    40557854
chr1    40558059    40558189
chr1    40562776    40562920
chr1    43392701    43392922

Desired output

chr1    40562776    40562920   gene:exon
chr1    43392701    43392922   gene:exon

Output (as of now)

chr1    154541927    154542093
|TDRD10:1;TDRD10:2;TDRD10:7
chr1    154542230    154542350
|
chr1    154542723    154542853
|TDRD10:3;TDRD10:8
chr1    154543654    154544647
|TDRD10:10;TDRD10:2;TDRD10:3;TDRD10:4;TDRD10:5;TDRD10:9
ADD REPLY
0
Entering edit mode

I think you are missing a hyphen in --skip-unmapped, but --echo-map-id-uniq is working correctly.

The ID field includes the exon number, in this case, so an ID like TDRD10:10 is treated (correctly) as a distinct string from TDRD10:2.

If you want a tab between range and IDs, use --delim '\t' with bedmap. This replaces the pipe character ('|') with a tab.

Otherwise, you could collapse the IDs by prefix by post-processing the current output with Perl or Python, etc.

For instance:

#!/usr/bin/env perl

use strict;
use warnings;

while (<STDIN>) {
    chomp;
    my ($rangeStr, $idsStr) = split("|", $_);
    my @ids = split(";", $idsStr);
    my $prefixHash;
    foreach my $id (@ids) {
        my ($prefix, $exon) = split(":", $id);
        if (! defined $prefixHash->{$prefix}) { $prefixHash->{$prefix} = (); }
        push $exon, @($prefixHash->{$prefix});
    }
    print STDOUT $rangeStr."\t";
    foreach my $prefix (sort keys %{$prefixHash}) {
        print STDOUT "$prefix:".join(",", sort @{$prefixHash->{$prefix}}).";";
    }
    print STDOUT "\n";
}

Untested, but this should perhaps spit out something more readable, like:

chr1    154542723    154542853    TDRD10:3,8;
chr1    154543654    154544647    TDRD10:2,3,4,5,9,10;

Note that you would want to use --skip-unmapped to prepare input for this script. Or else you'll likely get one or more lines with an empty fourth field where there are no mapped elements.

ADD REPLY
0
Entering edit mode

I named the script gene_exon.pl and placed the file bedmap created (answer.bed) in a directory named file.

I did a cd to that directory and then:

perl ~/get_exon.pl  > split.txt
Errors:

perl ~/gene_exon.pl  > QC.txt
Scalar found where operator expected at /home/cmccabe/gene_exon.pl line 12, near "@($prefixHash"
        (Missing operator before $prefixHash?)
syntax error at /home/cmccabe/gene_exon.pl line 12, near "@($prefixHash"
Global symbol "$rangeStr" requires explicit package name at /home/cmccabe/gene_exon.pl line 14.
Global symbol "$prefixHash" requires explicit package name at /home/cmccabe/gene_exon.pl line 15.
Global symbol "$prefixHash" requires explicit package name at /home/cmccabe/gene_exon.pl line 16.
syntax error at /home/cmccabe/gene_exon.pl line 19, near "}"
Execution of /home/cmccabe/gene_exon.pl aborted due to compilation errors.

Thank you :)

ADD REPLY
0
Entering edit mode

Try:

#!/usr/bin/env perl

use strict;
use warnings;

while (<STDIN>) {
    chomp;
    my ($rangeStr, $idsStr) = split("\\|", $_);
    my @ids = split(";", $idsStr);
    my $prefixHash;
    foreach my $id (@ids) {
        my ($prefix, $exon) = split(":", $id);
        push @{$prefixHash->{$prefix}}, $exon;
    }
    print STDOUT $rangeStr."\t";
    foreach my $prefix (sort keys %{$prefixHash}) {
        print STDOUT "$prefix:".join(",", sort @{$prefixHash->{$prefix}}).";";
    }
    print STDOUT "\n";
}

Like I said, this is mainly to show what could be done generally to collapse IDs by prefix. You may need to tweak this further depending on your input.

ADD REPLY
0
Entering edit mode
8.8 years ago

You're looking for intersectBed from bedtools

ADD COMMENT
0
Entering edit mode

Thank you very much for the information, I get the below error. Thank you :).

intersectBed -a gene_exon.bed -b epilepsy70_medex_edit.bed > file.bed
Unexpected file format.  Please use tab-delimited BED, GFF, or VCF. Perhaps you have non-integer starts or ends at line 1?

gene_exon.bed

chr1    11868    12227    DDX11L1     1
chr1    12009    12057    DDX11L1     1
chr1    12178    12227    DDX11L1     2
chr1    12612    12697    DDX11L1     3
chr1    12612    12721    DDX11L1     2
chr1    12974    13052    DDX11L1     4
chr1    13220    13374    DDX11L1     5
chr1    13220    14409    DDX11L1     3
chr1    13452    13670    DDX11L1     6
chr1    14403    14501    WASH7P     11
chr1    15004    15038    WASH7P     10
chr1    15795    15947    WASH7P     9
chr1    16606    16765    WASH7P     8
chr1    16857    17055    WASH7P     7
chr1    17232    17368    WASH7P     6
chr1    17368    17436    MIR6859-2     1

epilepsy70_medex_edit.bed

chr1    40539722    40539865
chr1    40542489    40542609
chr1    40544221    40544341
chr1    40546054    40546174
chr1    40555071    40555194
ADD REPLY
0
Entering edit mode

Hm... It should work with those files. Are there non-integer lines? Is the first line blank? Is it tab delimited or spaces?

ADD REPLY

Login before adding your answer.

Traffic: 2320 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6