Easy Way To Get 3' Utr Lengths Of A List Of Genes
Entering edit mode
10.3 years ago
Paul ▴ 760

Hi, as the title says really, I'm wondering if there is any tool available that would allow me to drop in a list of say entrez gene ids and get their corresponding 3' UTR lenghts?

Thanks for any suggestions.

utr • 21k views
Entering edit mode
10.3 years ago

As an alternative web-based solution the following will give you 3' UTR information for each transcript of a gene, but will require a little subtraction, try the Ensembl Biomart, e.g.

  1. Choose Database -> Ensembl Genes 60
  2. Choose Dataset -> Homo Sapiens Genes
  3. Click "Filters" on left hand menubar
  4. Expand "Gene" section by clicking "+"
  5. Select "ID list limit" check box.
  6. Select Entrez gene IDs from ID list limit drop down menu
  7. Paste in list of Entrez gene IDs
  8. Click "Attributes" on left hand menubar
  9. Click "Sequences" radio button
  10. Expand "Sequences" section by clicking "+"
  11. Check "3' UTR" under "sequences" header
  12. Expand "Header information" section by clicking "+"
  13. Check "3' UTR start" and "3' UTR end" and "Transcript name" under "Transcript information" header
  14. Click "Results" button at top left.

This will give you a set of fasta files of 3' UTRs for all transcripts for your set of Entrez gene IDs, which contain the start and stop of each 3' UTR on genome coordinates. I believe this solution has the same problem of not accounting for introns in 3' UTRs, but because of the gene<->transcript<->UTR mapping, it will account for alternative 3' UTRs.

Entering edit mode

Paul, you can actually automate the query which Casey described by first doing it manually, then you click on 'Perl' in the upper right-hand corner of the interface and you will get a Perl program that retrieves the same query-results programmatically. You can then tweak the Perl program to your needs.

Entering edit mode

Or, to save the clicking, you could hack up a little Perl script using the EnsEMBL Perl API ;-)

Entering edit mode

Regarding to 3' UTR, why i cannot search 3' UTR sequence on the cDNA transcript sequence? 5' UTR is search no problem.

Entering edit mode
4.8 years ago
marcosmorgan ▴ 110

This can be done using the GenomicFeatures library from Bioconductor (and dplyr)

I will use the refSeq transcripts ("refGene") from mouse ("mm10")


refSeq             <- makeTxDbFromUCSC(genom="mm10",tablename="refGene")                     
threeUTRs          <- threeUTRsByTranscript(refseq, use.names=TRUE)
length_threeUTRs   <- width(ranges(threeUTRs))
the_lengths        <- as.data.frame(length_threeUTRs)
the_lengths        <- the_lengths %>% group_by(group, group_name) %>% summarise(sum(value))
the_lengths        <- unique(the_lengths[,c("group_name", "sum(value)")])
colnames(the_lengths) <- c("RefSeq Transcript", "3' UTR Length")

The dataframe "the_lengths" has what you need.

Entering edit mode

This was great, thanks. I needed to plot a histogram of 3'UTR lengths for a whole genome and this was a neat and accurate solution. I tried lots of other ways and none of them were successful. As an additional note I discovered that "refGene" is no longer a supported table so used the supportedUCSCtables() function to determine that "knownGene" is the currently supported table.

> supportedUCSCtables(genome = "mm10", url="http://genome.ucsc.edu/cgi-bin/")
             tablename          track          subtrack
1            knownGene     UCSC Genes              <NA>


I also noted that the gene with longest 3'UTR is an outlier and may actually be erroneous as it is a noncoding gene but for some reason there is a short coding track in its UCSC annotation which results init being annotated as having a long 3'UTR. The upper limit for 3'UTR lengths seems to be about 16k.

> the_lengths<- the_lengths[order(the_lengths$`3' UTR Length`,decreasing=TRUE),]
> head(the_lengths)
# A tibble: 6 x 2
  `RefSeq Transcript` `3' UTR Length`
                <chr>           <int>
1          uc012fxx.1           82649
2          uc009css.1           15929
3          uc009cst.1           15929
4          uc008drr.3           15586
5          uc008drs.2           15586
6          uc008dfj.2           14319
Entering edit mode
10.3 years ago

The table KnownGene in the UCSC database contains all the information you want about the structure of the gene (the positions of the introns, exons, cdsStart/end , txStart/end).

The table kgXref contains the NCBI id and is linked to KnownGene.

for the genes on the '+' strand the query would be (for rapidity, I won't take in account any splicing between the last codon and the end of the transcription, it would need more code than a simple SQL query ):

mysql  -h  genome-mysql.cse.ucsc.edu -A -u genome -D hg18
mysql>select distinct X.geneSymbol, K.txEnd-K.cdsEnd
 kgXref as X,
 knownGene as K
   and K.name=X.kgId and
   K.strand="+" ;

| geneSymbol    | K.txEnd-K.cdsEnd |
| BC032353      |             3006 |
| AX748260      |             3157 |
| BC048429      |             1540 |
| OR4F5         |                0 |
| OR4F5         |                1 |
| DQ599874      |               31 |
| DQ599768      |               78 |
Entering edit mode
10.3 years ago

Hmmm, there is not typically alternative splicing of 3'-UTRs, but it can happen. There certainly are lots of examples of alternate terminal exons. So, I would not want to link gene symbol to 3'-UTR length, but rather gene symbol to mRNA identifier to its 3'-UTR length. Perhaps Pierre's table above shows that for gene OR4F5, but a length of zero is not a good test for one gene with 2 mRNA isoforms and hence two different, or not, 3'-UTR lengths.

Just something to consider...

Entering edit mode

The gene I worked on in grad school had alternate 3'-UTRs. http://www.ncbi.nlm.nih.gov/pubmed/1487151 Differed by over 700 bp. That said, another caveat of pulling this from dbs is that I know for UCSC that they don't make the call on presence of a UTR, they rely on the genbank record. Absence of a UTR does not necessarily mean there isn't one.


Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6