Get one maximum UTR per gene
Entering edit mode
3.8 years ago
max • 0

I have a list of about 3,000 genes and I want to retrieve the genomic locus of each gene's 3' UTR. My goal is to screen all the sequence that could be part of a gene's 3' UTR, even if some of that sequence isn't always in the 3' UTR of every transcript.

I've tried the methods in a few previous answers and, for about 80% of the genes in my list, I can find the predicted start and end of the 3' UTR using Biomart or the UCSC table browser. The problem is that for each gene I get multiple results (for all the alternative transcripts of that gene), and in each result the 3' UTR starts and ends at a different place. What I would like is the site of the most upstream 3' UTR start and the most downstream 3' UTR end that have been predicted for a given gene.

Does anyone know a straightforward way to get these from UCSC or Biomart? Can I perhaps get the shortest predicted CDSend and longest predicted transcription end?

Thanks for your help!

3' UTR UTR UCSC Biomart • 1.5k views
Entering edit mode

Can you post a snippet (head) of your genes and UTR datasets? I have a solution in mind, but it would depend on the format of your datasets. Feel free to let us know if you'd like a hand. Cheers.

Entering edit mode

Thanks for your help!

Here's a snippet of the output from Biomart.

I select dataset>'Human genes (GRCH37.p13)', Filters>'Gene Names' (here I paste a list of gene names, but I have refseq mRNA IDs too), Attributes>'Gene stable ID, Transcript stable ID, Gene name, 3' UTR start, 3' UTR end', Results>'unique results only'

Gene stable ID  Transcript stable ID    Gene name   3' UTR start    3' UTR end
ENSG00000115760 ENST00000421745 BIRC6       
ENSG00000115760 ENST00000421745 BIRC6   32842972    32843966
ENSG00000115760 ENST00000466527 BIRC6       
ENSG00000115760 ENST00000444173 BIRC6       
ENSG00000115760 ENST00000431454 BIRC6       
ENSG00000115760 ENST00000431454 BIRC6   32667175    32667294
ENSG00000115760 ENST00000431454 BIRC6   32667392    32667448
ENSG00000115760 ENST00000483194 BIRC6       
ENSG00000115760 ENST00000462504 BIRC6       
ENSG00000115760 ENST00000497023 BIRC6       
ENSG00000115760 ENST00000496555 BIRC6       
ENSG00000115760 ENST00000461788 BIRC6       
ENSG00000115760 ENST00000471232 BIRC6       
ENSG00000115760 ENST00000470250 BIRC6       
ENSG00000115760 ENST00000465130 BIRC6       
ENSG00000124171 ENST00000371610 PARD6B      
ENSG00000124171 ENST00000371610 PARD6B  49367026    49370278
ENSG00000124171 ENST00000396039 PARD6B      
ENSG00000124171 ENST00000396039 PARD6B  49373281    49373332
ENSG00000102189 ENST00000322349 EEA1        
ENSG00000102189 ENST00000322349 EEA1    93164413    93169786
ENSG00000102189 ENST00000418984 EEA1        
ENSG00000102189 ENST00000418984 EEA1    93251207    93251255
ENSG00000102189 ENST00000418984 EEA1    93251051    93251116
ENSG00000102189 ENST00000418984 EEA1    93247691    93247730
ENSG00000102189 ENST00000418984 EEA1    93246688    93246801
ENSG00000102189 ENST00000418984 EEA1    93245951    93246072
ENSG00000102189 ENST00000418984 EEA1    93244887    93245042
Entering edit mode
3.8 years ago

What I would like is the site of the most upstream 3' UTR start and the most downstream 3' UTR end that have been predicted for a given gene.

Given the format of your posted snippet of output, you could use the following Python script to get a JSON object of gene IDs and most-upstream and most-downstream start and stop positions, respectively:

#!/usr/bin/env python

import sys
import json

def main():
    c = 0
    d = {}
    for l in sys.stdin:
        l = l.strip()
        if c > 0:
                (gsi, tsi, gn, start, end) = l.split('\t')
                if gsi not in d:
                    d[gsi] = {}
                    d[gsi]['hgnc'] = gn
                if 'start' not in d[gsi] or d[gsi]['start'] < start:
                    d[gsi]['start'] = start
                if 'end' not in d[gsi] or d[gsi]['end'] > end:
                    d[gsi]['end'] = end
            except ValueError as ve:
        c += 1
    sys.stdout.write("%s\n" % json.dumps(d, indent=2))

if __name__ == "__main__":

On your snippet:

$ < snippet.txt
  "ENSG00000115760": {
    "start": "32842972",
    "end": "32667294",
    "hgnc": "BIRC6"
  "ENSG00000102189": {
    "start": "93251207",
    "end": "93169786",
    "hgnc": "EEA1"
  "ENSG00000124171": {
    "start": "49373281",
    "end": "49370278",
    "hgnc": "PARD6B"

You could write the output to a file like so:

$ < snippet.txt > positions.json

You can modify the format of this output by changing the line:

sys.stdout.write("%s\n" % json.dumps(d, indent=2))

to something else.

Entering edit mode

Hi Alex,

This is brilliant - thank you very much indeed. I was going to try something similar but yours is much cleaner. I'm new to JSON but it seems a really convenient way to store the data.



Entering edit mode

I'm working on this again and have noticed a problem - some of the UTRs are >10,000bp long, which seems very unlikely. MEIS1, for example, has a UTR 130,000bp long. Is this an annotation problem? Is there a way to work around this?

Thanks again!

Entering edit mode
3.8 years ago

You could try BEDTools 'merge' to collapse overlapping UTRs into one, but I anticipate a few problems: 1) if any of the UTRs do not overlap, you'll still be left with multiples; 2) UTRs from different genes might overlap, and those would collapse as well.


Login before adding your answer.

Traffic: 2459 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6