Annotating gene ID and genomic features from bed files
Entering edit mode
2.4 years ago

Hello, I have a bed file from a CLIP-seq experiment that has the columns: chr, start, stop, -log10(p-value), log2fold enrichment and strand (The bed file was generated with bedtools intersect of three bed files). I wish to annotate the intervals with gene name and genomic features (3'UTR, exon, intron, etc). I have found several ways to just annotate the gene name but I would really like to include the genomic features if possible. Any recommendations? Thanks!

genome CLIP • 854 views
Entering edit mode
2.4 years ago

Using BEDOPS convert2bed, you can make a file of named features:

$ wget -qO- \
    | gunzip --stdout - \
    | convert2bed -i gff - \
    | awk -vFS="\t" -vOFS="\t" '{ print $1, $2, $3, $4"|"$8 }' - \
    > named-features.bed

Adjust the GENCODE download URL as needed for your reference genome and desired release version.

Then map your sorted BED file to the named features (assuming your BED file is called clip-seq.bed):

$ bedmap --echo --echo-map-id --delim '\t' <(sort-bed clip-seq.bed) named-features.bed > answer.bed

The name and feature category will be in the last column of answer.bed.

If you need to translate Ensembl IDs to HGNC names, one option is to modify the awk statement above to parse out the HGNC name from the optional attributes in the GFF file. As they may not always be present, another option is the use of the mygene service. An example Python script is available here that demonstrates retrieval for mm10 records: A: Get gene names from ensembl ID or gene region


Login before adding your answer.

Traffic: 2315 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6