Question: Easy Way To Get 3' Utr Lengths Of A List Of Genes
gravatar for Paul
9.7 years ago by
United States
Paul760 wrote:

Hi, as the title says really, I'm wondering if there is any tool available that would allow me to drop in a list of say entrez gene ids and get their corresponding 3' UTR lenghts?

Thanks for any suggestions.

utr • 20k views
ADD COMMENTlink modified 4.2 years ago by marcosmorgan110 • written 9.7 years ago by Paul760
gravatar for Casey Bergman
9.7 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

As an alternative web-based solution the following will give you 3' UTR information for each transcript of a gene, but will require a little subtraction, try the Ensembl Biomart, e.g.

  1. Choose Database -> Ensembl Genes 60
  2. Choose Dataset -> Homo Sapiens Genes
  3. Click "Filters" on left hand menubar
  4. Expand "Gene" section by clicking "+"
  5. Select "ID list limit" check box.
  6. Select Entrez gene IDs from ID list limit drop down menu
  7. Paste in list of Entrez gene IDs
  8. Click "Attributes" on left hand menubar
  9. Click "Sequences" radio button
  10. Expand "Sequences" section by clicking "+"
  11. Check "3' UTR" under "sequences" header
  12. Expand "Header information" section by clicking "+"
  13. Check "3' UTR start" and "3' UTR end" and "Transcript name" under "Transcript information" header
  14. Click "Results" button at top left.

This will give you a set of fasta files of 3' UTRs for all transcripts for your set of Entrez gene IDs, which contain the start and stop of each 3' UTR on genome coordinates. I believe this solution has the same problem of not accounting for introns in 3' UTRs, but because of the gene<->transcript<->UTR mapping, it will account for alternative 3' UTRs.

ADD COMMENTlink written 9.7 years ago by Casey Bergman18k

Paul, you can actually automate the query which Casey described by first doing it manually, then you click on 'Perl' in the upper right-hand corner of the interface and you will get a Perl program that retrieves the same query-results programmatically. You can then tweak the Perl program to your needs.

ADD REPLYlink written 9.7 years ago by Joachim2.9k

Or, to save the clicking, you could hack up a little Perl script using the EnsEMBL Perl API ;-)

ADD REPLYlink written 9.4 years ago by Steve Moss2.3k

Regarding to 3' UTR, why i cannot search 3' UTR sequence on the cDNA transcript sequence? 5' UTR is search no problem.

ADD REPLYlink written 9.1 years ago by Jirapong20
gravatar for marcosmorgan
4.2 years ago by
MRC centre for regenerative medicine, University of Edinburgh
marcosmorgan110 wrote:

This can be done using the GenomicFeatures library from Bioconductor (and dplyr)

I will use the refSeq transcripts ("refGene") from mouse ("mm10")


refSeq             <- makeTxDbFromUCSC(genom="mm10",tablename="refGene")                     
threeUTRs          <- threeUTRsByTranscript(refseq, use.names=TRUE)
length_threeUTRs   <- width(ranges(threeUTRs))
the_lengths        <-
the_lengths        <- the_lengths %>% group_by(group, group_name) %>% summarise(sum(value))
the_lengths        <- unique(the_lengths[,c("group_name", "sum(value)")])
colnames(the_lengths) <- c("RefSeq Transcript", "3' UTR Length")

The dataframe "the_lengths" has what you need.

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by marcosmorgan110

This was great, thanks. I needed to plot a histogram of 3'UTR lengths for a whole genome and this was a neat and accurate solution. I tried lots of other ways and none of them were successful. As an additional note I discovered that "refGene" is no longer a supported table so used the supportedUCSCtables() function to determine that "knownGene" is the currently supported table.

> supportedUCSCtables(genome = "mm10", url="")
             tablename          track          subtrack
1            knownGene     UCSC Genes              <NA>


I also noted that the gene with longest 3'UTR is an outlier and may actually be erroneous as it is a noncoding gene but for some reason there is a short coding track in its UCSC annotation which results init being annotated as having a long 3'UTR. The upper limit for 3'UTR lengths seems to be about 16k.

> the_lengths<- the_lengths[order(the_lengths$`3' UTR Length`,decreasing=TRUE),]
> head(the_lengths)
# A tibble: 6 x 2
  `RefSeq Transcript` `3' UTR Length`
                <chr>           <int>
1          uc012fxx.1           82649
2          uc009css.1           15929
3          uc009cst.1           15929
4          uc008drr.3           15586
5          uc008drs.2           15586
6          uc008dfj.2           14319
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by vperreau0
gravatar for Pierre Lindenbaum
9.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

The table KnownGene in the UCSC database contains all the information you want about the structure of the gene (the positions of the introns, exons, cdsStart/end , txStart/end).

The table kgXref contains the NCBI id and is linked to KnownGene.

for the genes on the '+' strand the query would be (for rapidity, I won't take in account any splicing between the last codon and the end of the transcription, it would need more code than a simple SQL query ):

mysql  -h -A -u genome -D hg18
mysql>select distinct X.geneSymbol, K.txEnd-K.cdsEnd
 kgXref as X,
 knownGene as K
   and and
   K.strand="+" ;

| geneSymbol    | K.txEnd-K.cdsEnd |
| BC032353      |             3006 |
| AX748260      |             3157 |
| BC048429      |             1540 |
| OR4F5         |                0 |
| OR4F5         |                1 |
| DQ599874      |               31 |
| DQ599768      |               78 |
ADD COMMENTlink modified 12 months ago by RamRS30k • written 9.7 years ago by Pierre Lindenbaum130k
gravatar for Larry_Parnell
9.7 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

Hmmm, there is not typically alternative splicing of 3'-UTRs, but it can happen. There certainly are lots of examples of alternate terminal exons. So, I would not want to link gene symbol to 3'-UTR length, but rather gene symbol to mRNA identifier to its 3'-UTR length. Perhaps Pierre's table above shows that for gene OR4F5, but a length of zero is not a good test for one gene with 2 mRNA isoforms and hence two different, or not, 3'-UTR lengths.

Just something to consider...

ADD COMMENTlink written 9.7 years ago by Larry_Parnell16k

The gene I worked on in grad school had alternate 3'-UTRs. Differed by over 700 bp. That said, another caveat of pulling this from dbs is that I know for UCSC that they don't make the call on presence of a UTR, they rely on the genbank record. Absence of a UTR does not necessarily mean there isn't one.

ADD REPLYlink written 9.7 years ago by Mary11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1844 users visited in the last hour