Question: Retrieve genomic physical coordinates of 3'UTR for set of genes
gravatar for Mr Locuace
5 months ago by
Mr Locuace100
Mr Locuace100 wrote:

Hello, I have a list of human genes and I'd like to retrieve the physical coordinates (GRCh37/hg19 assembly) of their 3'UTRs. Are you aware of any software that can do that? Thanks !

3'utr • 225 views
ADD COMMENTlink modified 5 months ago by vkkodali2.0k • written 5 months ago by Mr Locuace100

This answer holds good presuming that you already have a .gff file of the genome of interest:

I would recommend gffutils package in python (, I know you said "software" but this package is well documented for your need and you would basically just need to write 5-10 lines of code (which you can also find in the documentation) to retrieve the start and stop position on the basis of gene ids.

As an example, following is the code that I wrote :

#import the package
import gffutils

#create a "local database" from your gff file
db = gffutils.create_db(gff_file_path, dbfn = "local_db_1.db", keep_order = True,
                            force = True, sort_attribute_values = True, 
                            merge_strategy = 'merge')

#access every gene by its id like this
gene = db["gene_id"]

#access the gene's start and stop position like this

#for accessing UTRs of this gene
for item in db.region(gene, featuretype="three_prime_UTR"):

You can make a list of your ids and then use for loop to access every gene's and its respective UTR's start and stop position. Good luck!

ADD REPLYlink written 5 months ago by manaswwm90
gravatar for vkkodali
5 months ago by
United States
vkkodali2.0k wrote:

For RefSeq annotation, you can use the add_utrs_to_gff python script to first add 5' and 3' UTR features and then use unix grep to extract the genes of your interest. The latest RefSeq annotation of the GRCh37 assembly is here:

## download annotation in GFF3 format
$ curl -O

## download the add_utrs_to_gff3 python script 
$ curl -O

## add utr features to the gff3 file 
$ python3 GCF_000001405.25_GRCh37.p13_genomic.gff.gz > GRCh37_with_utrs.gff3

## extract 5' UTR for GeneID:5768 
$ grep 'five_prime_UTR' GRCh37_with_utrs.gff3 | grep -w 'GeneID:5768'
NC_000001.10    BestRefSeq      five_prime_UTR  180123968       180124042       .       +       .       ID=utr00100412821;Parent=rna-NM_001004128.2;transcript_id=NM_001004128.2;Dbxref=GeneID:5768,Genbank:NM_001004128.2,HGNC:HGNC:9756,MIM:603120
NC_000001.10    BestRefSeq      five_prime_UTR  180124004       180124042       .       +       .       ID=utr00282651;Parent=rna-NM_002826.5;transcript_id=NM_002826.5;Dbxref=GeneID:5768,Genbank:NM_002826.5,HGNC:HGNC:9756,MIM:603120
ADD COMMENTlink written 5 months ago by vkkodali2.0k

Thanks very much @vkkodali ! But how to do it for a large list of GeneIDs?

ADD REPLYlink written 5 months ago by Mr Locuace100

You can use grep -f as shown below:

 ## make a list of all genes you are interested in, one gene ID for each line 
$ cat genes.txt 

## extract 5' UTRs
$ grep 'five_prime_UTR' GRCh37_with_utrs.gff3 | grep -w -f genes.txt
ADD REPLYlink written 5 months ago by vkkodali2.0k

Great, thanks @vkkodali !

ADD REPLYlink written 5 months ago by Mr Locuace100
gravatar for ATpoint
5 months ago by
ATpoint35k wrote:

Download an annotation file for hg19, e.g. from GENCODE, then extract UTRs:

Extracting 5'UTR and 3'UTR bed files from gtf file

Then subset for the genes you are interested in.

ADD COMMENTlink written 5 months ago by ATpoint35k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1776 users visited in the last hour