Question: Ensembl Annotation
gravatar for essell22
7.2 years ago by
United States
essell2210 wrote:

I'm looking to annotate my genomic data that is in a bed-like format ex.

> Chr1    200    4000    ...

I would like to use Ensembl to annotate my data and append the overlapping transcripts, genes, and genomic features (i.e. promoters, exons, introns... ) to my original file. ex:

> Chr1    200    4000    ...    TranscriptID    GeneID    GenomicFeature

I was wondering what would be the most efficient way to do this.

ensembl annotation python • 3.3k views
ADD COMMENTlink modified 4.2 years ago by Biostar ♦♦ 20 • written 7.2 years ago by essell2210
gravatar for Alex Reynolds
7.2 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

One way to do this is with the BEDOPS bedmap application and the --echo-map operator.

First, you'd need to prepare your BED-like data (let's say it is a file called myData.bed) to make it work with the chromosome naming scheme that Ensembl uses for its annotations:

$ sed -e 's/^Chr//' myData.bed > myFixedData.bed

This turns ChrN record names into N, which seems to follow Ensembl's naming convention. (You may want to use head or similar on Ensembl records to see how chromosomes are named, and change this sed statement accordingly.)

Secondly, grab your Ensembl annotations and turn them into BED files with the BEDOPS gtf2bed conversion script. Let's say you're working with human data and want human annotations. The following command grabs Ensembl r71 annotations and turns them into a sorted BED file, ready for use with all BEDOPS tools:

$ wget -O - \
    | gunzip -c - \
    | gtf2bed - \
    > Homo_sapiens.GRCh37.71.bed

Finally, we're ready to map these Ensembl annotations to our regions of interest (myFixedData.bed):

$ bedmap --echo --echo-map myFixedData.bed Homo_sapiens.GRCh37.71.bed > myAnnotatedData.bed

The file myAnnotatedData.bed contains data in the following format:

[ myRegion-1 ] | [ annotation-1-1 ] ; ... ; [ annotation-1-i ]
[ myRegion-N ] | [ annotation-N-1 ] ; ... ; [ annotation-N-j ]

In other words, each region from 1..N is printed, followed by a pipe character, and then a semi-colon-delimited list of human Ensembl annotations that overlap your region by one or more bases. Where there are no annotations, no data are printed for that region.

More information about --echo-map and overlap options are in the bedmap documentation.

If you just want certain fields from the annotations, you can post-process myAnnotatedData.bed with a (for example) Perl, Python or awk script, using the pipe, semi-colon and GTF delimiters to read through each annotation and pick out the fields of interest (e.g., transcript_id, gene_id, gene_biotype etc.).

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Alex Reynolds30k
gravatar for k.nirmalraman
7.2 years ago by
k.nirmalraman1.0k wrote:

One option is to convert the annotation into bed format and then use the combination of bedtools and unix commands to get the desired output.

Bedtools is pretty well documented..

ADD COMMENTlink written 7.2 years ago by k.nirmalraman1.0k

For example, you might try 'bedtools map' with the '-c' and '-o distinct' option.

ADD REPLYlink written 7.2 years ago by Malachi Griffith18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 804 users visited in the last hour