Question: overlap gene and exon with bed file
0
gravatar for bioguy24
4.0 years ago by
bioguy24190
Chicago
bioguy24190 wrote:

I have a file (gene_exons.bed) that I am trying to use to map a bed file with baits to.  That is use the bait bed to find if it overlaps the gene_exon.bed and if it does, output the result.

bedmap --echo --echo-map-id-uniq epilepsy70_medex_edit.bed output.bed > answer.bed

bedmap --echo --echo-map epilepsy70_medex_edit.bed output.bed > gene_exon.bed

I have tried both commands above but can not produce the desired result.  Is it possible?  Thank you :)

gene_exon.bed

chr1    11868    12227    DDX11L1     1
chr1    12009    12057    DDX11L1     1
chr1    12178    12227    DDX11L1     2
chr1    12612    12697    DDX11L1     3
chr1    12612    12721    DDX11L1     2
chr1    12974    13052    DDX11L1     4
chr1    13220    13374    DDX11L1     5
chr1    13220    14409    DDX11L1     3
chr1    13452    13670    DDX11L1     6

 

epilepsy70_medex_edit.bed (baits)

chr1    40539722    40539865
chr1    40542489    40542609
chr1    40544221    40544341
chr1    40546054    40546174
chr1    40555071    40555194
chr1    40556976    40557096
chr1    40557706    40557854
chr1    40558059    40558189
chr1    40562776    40562920
chr1    43392701    43392922

Desired output

chr1    40562776    40562920   gene   exon
chr1    43392701    43392922   gene   exon
bedops • 1.4k views
ADD COMMENTlink modified 4.0 years ago by karl.stamm3.5k • written 4.0 years ago by bioguy24190

If you need to validate BED files, you can use bedops --ec --everything:

$ bedops --ec --everything gene_exon.bed > /dev/null
$ bedops --ec --everything epilepsy70_medex_edit.bed > /dev/null

This will tell you if a BED file is missing information, has malformed fields, or is not sorted correctly (i.e., per BEDOPS sort-bed).

If you need results in a different format, you can use bedmap --echo-map-id, but instead pre-process your inputs to contain both gene and exon number in the ID field.

When mapping with bedmap, the ID field will contain both gene name and exon number and so the output will contain that mapped result.

Otherwise, it's not immediately clear to me exactly what you are trying to do with your inputs. In that case, if you want to post a snippet of both input files epilepsy70_medex_edit.bed and output.bed, as well as a snippet of the expected output file answer.bed, that might help me help write you an awk statement that does the necessary pre-processing.

I don't think either BEDOPS or bedtools will do what you want without some extra work, but please feel free to post more info and I'll try to help show you how to do things with BEDOPS.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Alex Reynolds28k
The below command:
bedmap --echo --echo-map-id-uniq epilepsy70_medex_edit.bed gene_exon.bed > answer.bed

creates a answer.bed that is close (that is it includes the gene name but not the exon):

answer.bed

chr19    50365271    50365391 |
chr19    50365430    50365550 |NAPSA
chr19    50365614    50365734 |NAPSA

Basically, I am just trying to ge tthe exon number included.  I do have a file (gene_exon.bed) contains that info.

epilepsy70_medex_edit.bed (baits)

chr1    40539722    40539865
chr1    40542489    40542609
chr1    40544221    40544341
chr1    40546054    40546174
chr1    40555071    40555194
chr1    40556976    40557096
chr1    40557706    40557854
chr1    40558059    40558189
chr1    40562776    40562920
chr1    43392701    43392922
Desired answer.bed

chr1    40562776    40562920   gene   exon
chr1    43392701    43392922   gene   exon

 

Thank you very much :).  I will also try to concatenate all the values in ID)

ADD REPLYlink written 4.0 years ago by bioguy24190

Can the below command be used to identify the gene and exon of a given overlap?  I have a file with the gene:exon concatenated in column 4, but the results (as of now) do not seem to be using this file.  Thank you :).

bedmap --echo --skip unmapped --echo-map-id-uniq epilepsy70_medex_edit.bed test.bed > answer.bed

test.bed
chr1    11868    12227    DDX11L1:1
chr1    12009    12057    DDX11L1:1
chr1    12178    12227    DDX11L1:2
chr1    12612    12697    DDX11L1:3
chr1    12612    12721    DDX11L1:2
chr1    12974    13052    DDX11L1:4
chr1    13220    13374    DDX11L1:5
chr1    13220    14409    DDX11L1:3
chr1    13452    13670    DDX11L1:6

epilepsy70_medex_edit.bed (baits)
chr1    40539722    40539865
chr1    40542489    40542609
chr1    40544221    40544341
chr1    40546054    40546174
chr1    40555071    40555194
chr1    40556976    40557096
chr1    40557706    40557854
chr1    40558059    40558189
chr1    40562776    40562920
chr1    43392701    43392922

Desired output
chr1    40562776    40562920   gene:exon
chr1    43392701    43392922   gene:exon

Output (as of now)
chr1    154541927    154542093
|TDRD10:1;TDRD10:2;TDRD10:7
chr1    154542230    154542350
|
chr1    154542723    154542853
|TDRD10:3;TDRD10:8
chr1    154543654    154544647
|TDRD10:10;TDRD10:2;TDRD10:3;TDRD10:4;TDRD10:5;TDRD10:9
ADD REPLYlink written 4.0 years ago by bioguy24190

I think you are missing a hyphen in --skip-unmapped, but --echo-map-id-uniq is working correctly. 

The ID field includes the exon number, in this case, so an ID like TDRD10:10 is treated (correctly) as a distinct string from TDRD10:2.

If you want a tab between range and IDs, use --delim '\t' with bedmap. This replaces the pipe character ('|') with a tab.

Otherwise, you could collapse the IDs by prefix by post-processing the current output with Perl or Python, etc.

For instance:

#!/usr/bin/env perl

use strict;
use warnings;

while (<STDIN>) {
    chomp;
    my ($rangeStr, $idsStr) = split("|", $_);
    my @ids = split(";", $idsStr);
    my $prefixHash;
    foreach my $id (@ids) {
        my ($prefix, $exon) = split(":", $id);
        if (! defined $prefixHash->{$prefix}) { $prefixHash->{$prefix} = (); }
        push $exon, @($prefixHash->{$prefix});
    }
    print STDOUT $rangeStr."\t";
    foreach my $prefix (sort keys %{$prefixHash}) {
        print STDOUT "$prefix:".join(",", sort @{$prefixHash->{$prefix}}).";";
    }
    print STDOUT "\n";
}

Untested, but this should perhaps spit out something more readable, like:

chr1    154542723    154542853    TDRD10:3,8;
chr1    154543654    154544647    TDRD10:2,3,4,5,9,10;

Note that you would want to use --skip-unmapped to prepare input for this script. Or else you'll likely get one or more lines with an empty fourth field where there are no mapped elements.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Alex Reynolds28k

I named the script gene_exon.pl and placed the file bedmap created (answer.bed) in a directory named file.

I did a cd to that directory and then:

perl ~/get_exon.pl  > split.txt

 

Errors:

perl ~/gene_exon.pl  > QC.txt
Scalar found where operator expected at /home/cmccabe/gene_exon.pl line 12, near "@($prefixHash"
        (Missing operator before $prefixHash?)
syntax error at /home/cmccabe/gene_exon.pl line 12, near "@($prefixHash"
Global symbol "$rangeStr" requires explicit package name at /home/cmccabe/gene_exon.pl line 14.
Global symbol "$prefixHash" requires explicit package name at /home/cmccabe/gene_exon.pl line 15.
Global symbol "$prefixHash" requires explicit package name at /home/cmccabe/gene_exon.pl line 16.
syntax error at /home/cmccabe/gene_exon.pl line 19, near "}"
Execution of /home/cmccabe/gene_exon.pl aborted due to compilation errors.

Thank you :).

 

 
ADD REPLYlink written 4.0 years ago by bioguy24190

Try:

#!/usr/bin/env perl

use strict;
use warnings;

while (<STDIN>) {
    chomp;
    my ($rangeStr, $idsStr) = split("\\|", $_);
    my @ids = split(";", $idsStr);
    my $prefixHash;
    foreach my $id (@ids) {
        my ($prefix, $exon) = split(":", $id);
        push @{$prefixHash->{$prefix}}, $exon;
    }
    print STDOUT $rangeStr."\t";
    foreach my $prefix (sort keys %{$prefixHash}) {
        print STDOUT "$prefix:".join(",", sort @{$prefixHash->{$prefix}}).";";
    }
    print STDOUT "\n";
}

Like I said, this is mainly to show what could be done generally to collapse IDs by prefix. You may need to tweak this further depending on your input.

ADD REPLYlink written 4.0 years ago by Alex Reynolds28k
0
gravatar for karl.stamm
4.0 years ago by
karl.stamm3.5k
United States
karl.stamm3.5k wrote:

You're looking for intersectBed from bedtools

ADD COMMENTlink written 4.0 years ago by karl.stamm3.5k

 

Thank you very much for the information, I get the below error.  Thank you :).

intersectBed -a gene_exon.bed -b epilepsy70_medex_edit.bed > file.bed
Unexpected file format.  Please use tab-delimited BED, GFF, or VCF. Perhaps you have non-integer starts or ends at line 1?
gene_exon.bed

chr1    11868    12227    DDX11L1     1
chr1    12009    12057    DDX11L1     1
chr1    12178    12227    DDX11L1     2
chr1    12612    12697    DDX11L1     3
chr1    12612    12721    DDX11L1     2
chr1    12974    13052    DDX11L1     4
chr1    13220    13374    DDX11L1     5
chr1    13220    14409    DDX11L1     3
chr1    13452    13670    DDX11L1     6
chr1    14403    14501    WASH7P     11
chr1    15004    15038    WASH7P     10
chr1    15795    15947    WASH7P     9
chr1    16606    16765    WASH7P     8
chr1    16857    17055    WASH7P     7
chr1    17232    17368    WASH7P     6
chr1    17368    17436    MIR6859-2     1

epilepsy70_medex_edit.bed

chr1    40539722    40539865
chr1    40542489    40542609
chr1    40544221    40544341
chr1    40546054    40546174
chr1    40555071    40555194

 

ADD REPLYlink written 4.0 years ago by bioguy24190

Hm... It should work with those files. Are there non-integer lines? Is the first line blank?  Is it tab delimited or spaces?

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by karl.stamm3.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1669 users visited in the last hour