Is There A Easy Way To Find The Gene Annotations For Features In Bed File?
4
4
Entering edit mode
11.4 years ago
k.nirmalraman ★ 1.1k

Dear All,

Is there a any easy method to find the gene annotations that overlap in the features of a bed file?

For instance:

Chr1 2292052 2292086 AE016830.1 NA -

Chr1 2771733 2771767 AE016830.1 NA -

Chr1 254081 254115 AE016830.1 NA +

Chr1 1019821 1019855 AE016830.1 NA +

I would like to know if there are any genes/annotations that overlap in these coordinates? I would also like to extend the coordinates on either sides during such a search.

Thanks in advance!

gene annotation bed • 19k views
ADD COMMENT
12
Entering edit mode
11.4 years ago
Irsan ★ 7.8k

Download annotations from UCSC table browser use table "Genes and gene prediction tracks" and export the data as BED-file. Then use bedtools or bedops to find intersections between the downloaded bed-file from UCSC with the genes and your features bed file.

Also see:

ADD COMMENT
0
Entering edit mode

wow, great list of answers!

ADD REPLY
5
Entering edit mode
11.4 years ago

You can use the bedmap tool in the BEDOPS suite to quickly and efficiently generate an answer.

Let's say you have your features in a file called Features.bed and your genes in a file called Genes.bed. Also, the ID column of Genes.bed contains the name of the gene, following UCSC BED format conventions.

To first extend the coordinates of Features.bed by, say, 100 bases, use the --range operator with bedmap. Input that looks like this:

Chr1 2292052 2292086 AE016830.1 NA -
Chr1 2771733 2771767 AE016830.1 NA -
...

will look like this, when operated on (i.e., when searching for overlapping elements from Genes.bed) with a range adjustment of --range 100:

Chr1 2291952 2292186 AE016830.1 NA -
Chr1 2771633 2771867 AE016830.1 NA -
...

If you want to specify a non-symmetric padding, use --range L:R, replacing L and R with desired "left" and "right" integer values. You can use negative values to "shrink" or shift elements, as well. See the documentation for more details.

To find the genes contained within these padded ranges, use the following bedmap command to redirect output to a file called Answer.bed:

$ bedmap --echo --echo-map-id --range 100 Features.bed Genes.bed > Answer.bed

Output will look something like:

Chr1 2292052 2292086 AE016830.1 NA -|gene-4204;gene-19383
Chr1 2771733 2771767 AE016830.1 NA -|gene-20043;gene-20199
...

The pipe character (|) delimits unpadded features from genes which overlap padded features. The default overlap criterion between padded feature and gene is one or more bases. This overlap can be adjusted; see the documentation for more details.

Note: If you need the feature output to reflect padding, use bedops --range on the features and then pipe the result to bedmap, e.g.,:

$ bedops --range 100 --everything Features.bed | bedmap --echo --echo-map-id - Genes.bed > Answer.bed

Output in this case would show adjusted reference coordinates:

Chr1 2291952 2292186 AE016830.1 NA -|gene-4204;gene-19383
Chr1 2771633 2771867 AE016830.1 NA -|gene-20043;gene-20199
...

Please see the documentation for an explanation of the --range operator with bedops.

ADD COMMENT
0
Entering edit mode

Thanks much for a detailed answer!! That saves me!!

ADD REPLY
0
Entering edit mode

My feature file chr1 856634 856894 chr1 3540069 3540450 chr1 5466493 5466794 chr1 5514228 5514422 chr1 5771065 5771378 chr1 6370110 6370371 chr1 6801896 6802191 chr1 6996427 6996724 chr1 7117580 7117812 chr1 7127607 7127920

My annotation bed file

chr1    11868   14409   DDX11L1
chr1    14403   29570   WASH7P
chr1    17368   17436   MIR6859-2
chr1    29553   31109   MIR1302-10
chr1    30365   30503   MIR1302-11
chr1    34553   36081   FAM138A
chr1    52472   53312   OR4G4P
chr1    62947   63887   OR4G11P
chr1    69090   70008   OR4F5
chr1    89294   133723  RP11-34P13.7

My output looks like this

chr1    856634  856894|
chr1    3540069 3540450|
chr1    5466493 5466794|
chr1    5514228 5514422|
chr1    5771065 5771378|
chr1    6370110 6370371|
chr1    6801896 6802191|
chr1    6996427 6996724|
chr1    7117580 7117812|
chr1    7127607 7127920|

My command

bedmap --echo --echo-map-id --range 100 Normal_sorted.bed gene_annotation.bed > Answer.bed

what is going wrong here

ADD REPLY
1
Entering edit mode

None of the map elements (annotations) in your example overlap with the reference elements (features) in your example. That's why you get no overlaps printed out for each reference element in the output. Change your map and/or reference element sets appropriately.

ADD REPLY
0
Entering edit mode

okay i took out ATAC seq counts using multibamsummary which i was trying to map into the annotation looks like this won't be the way to map these

ADD REPLY
0
Entering edit mode

Maybe just use a more complete set of annotations, or loosen overlap rules? If there are other criteria for overlap, those criteria could be added to a bedmap call.

ADD REPLY
0
Entering edit mode

i made that annotaion file from the gtf file i used for alignment ,"complete set of annotations," you saying take everything ?

ADD REPLY
0
Entering edit mode

Unless your annotations are complete, I'd start with a known set of Gencode or Ensembl annotations. Or perhaps review how you are aligning reads to your annotations.

ADD REPLY
0
Entering edit mode

it is gencode annotation i aligned my atac seq reads to it

ADD REPLY
2
Entering edit mode
11.4 years ago

bedtools will be your friend.

ADD COMMENT
1
Entering edit mode
7.7 years ago
Shicheng Guo ★ 9.4k

how about bedtools closest

bedtools closest -fu -D "ref" -a input.bed -b Gene.bed | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$4,$9}' | sort -u > anno.bed
ADD COMMENT

Login before adding your answer.

Traffic: 2009 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6