Ensembl Annotation
Entering edit mode
8.0 years ago
essell22 ▴ 10

I'm looking to annotate my genomic data that is in a bed-like format ex.

> Chr1    200    4000    ...

I would like to use Ensembl to annotate my data and append the overlapping transcripts, genes, and genomic features (i.e. promoters, exons, introns... ) to my original file. ex:

> Chr1    200    4000    ...    TranscriptID    GeneID    GenomicFeature

I was wondering what would be the most efficient way to do this.

ensembl annotation python • 3.6k views
Entering edit mode
8.0 years ago

One way to do this is with the BEDOPS bedmap application and the --echo-map operator.

First, you'd need to prepare your BED-like data (let's say it is a file called myData.bed) to make it work with the chromosome naming scheme that Ensembl uses for its annotations:

$ sed -e 's/^Chr//' myData.bed > myFixedData.bed

This turns ChrN record names into N, which seems to follow Ensembl's naming convention. (You may want to use head or similar on Ensembl records to see how chromosomes are named, and change this sed statement accordingly.)

Secondly, grab your Ensembl annotations and turn them into BED files with the BEDOPS gtf2bed conversion script. Let's say you're working with human data and want human annotations. The following command grabs Ensembl r71 annotations and turns them into a sorted BED file, ready for use with all BEDOPS tools:

$ wget -O - ftp://ftp.ensembl.org//pub/mnt2/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    > Homo_sapiens.GRCh37.71.bed

Finally, we're ready to map these Ensembl annotations to our regions of interest (myFixedData.bed):

$ bedmap --echo --echo-map myFixedData.bed Homo_sapiens.GRCh37.71.bed > myAnnotatedData.bed

The file myAnnotatedData.bed contains data in the following format:

[ myRegion-1 ] | [ annotation-1-1 ] ; ... ; [ annotation-1-i ]
[ myRegion-N ] | [ annotation-N-1 ] ; ... ; [ annotation-N-j ]

In other words, each region from 1..N is printed, followed by a pipe character, and then a semi-colon-delimited list of human Ensembl annotations that overlap your region by one or more bases. Where there are no annotations, no data are printed for that region.

More information about --echo-map and overlap options are in the bedmap documentation.

If you just want certain fields from the annotations, you can post-process myAnnotatedData.bed with a (for example) Perl, Python or awk script, using the pipe, semi-colon and GTF delimiters to read through each annotation and pick out the fields of interest (e.g., transcript_id, gene_id, gene_biotype etc.).

Entering edit mode
8.0 years ago
k.nirmalraman ★ 1.1k

One option is to convert the annotation into bed format and then use the combination of bedtools and unix commands to get the desired output.

Bedtools is pretty well documented..

Entering edit mode

For example, you might try 'bedtools map' with the '-c' and '-o distinct' option.


Login before adding your answer.

Traffic: 1623 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6