Question

Annotating gene ID and genomic features from bed files

0

Entering edit mode

5.5 years ago

nornorbinks • 0

Hello, I have a bed file from a CLIP-seq experiment that has the columns: chr, start, stop, -log10(p-value), log2fold enrichment and strand (The bed file was generated with bedtools intersect of three bed files). I wish to annotate the intervals with gene name and genomic features (3'UTR, exon, intron, etc). I have found several ways to just annotate the gene name but I would really like to include the genomic features if possible. Any recommendations? Thanks!

genome CLIP • 1.8k views

ADD COMMENT • link updated 5.5 years ago by Alex Reynolds 36k • written 5.5 years ago by nornorbinks • 0

score 3 · Accepted Answer · 2020-01-09

Using BEDOPS convert2bed, you can make a file of named features:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gff3.gz \
    | gunzip --stdout - \
    | convert2bed -i gff - \
    | awk -vFS="\t" -vOFS="\t" '{ print $1, $2, $3, $4"|"$8 }' - \
    > named-features.bed

Adjust the GENCODE download URL as needed for your reference genome and desired release version.

Then map your sorted BED file to the named features (assuming your BED file is called clip-seq.bed):

$ bedmap --echo --echo-map-id --delim '\t' <(sort-bed clip-seq.bed) named-features.bed > answer.bed

The name and feature category will be in the last column of answer.bed.

If you need to translate Ensembl IDs to HGNC names, one option is to modify the awk statement above to parse out the HGNC name from the optional attributes in the GFF file. As they may not always be present, another option is the use of the mygene service. An example Python script is available here that demonstrates retrieval for mm10 records: A: Get gene names from ensembl ID or gene region