Question

Easy Way To Get 3' Utr Lengths Of A List Of Genes

14

Entering edit mode

13.3 years ago

Paul ▴ 760

Hi, as the title says really, I'm wondering if there is any tool available that would allow me to drop in a list of say entrez gene ids and get their corresponding 3' UTR lenghts?

Thanks for any suggestions.

utr • 26k views

ADD COMMENT • link updated 2.6 years ago by t.montserrat.ayuso • 0 • written 13.3 years ago by Paul ▴ 760

score 19 · Answer 1 · 2011-01-13

19

Entering edit mode

13.3 years ago

Casey Bergman 18k

As an alternative web-based solution the following will give you 3' UTR information for each transcript of a gene, but will require a little subtraction, try the Ensembl Biomart, e.g.

Choose Database -> Ensembl Genes 60
Choose Dataset -> Homo Sapiens Genes
Click "Filters" on left hand menubar
Expand "Gene" section by clicking "+"
Select "ID list limit" check box.
Select Entrez gene IDs from ID list limit drop down menu
Paste in list of Entrez gene IDs
Click "Attributes" on left hand menubar
Click "Sequences" radio button
Expand "Sequences" section by clicking "+"
Check "3' UTR" under "sequences" header
Expand "Header information" section by clicking "+"
Check "3' UTR start" and "3' UTR end" and "Transcript name" under "Transcript information" header
Click "Results" button at top left.

This will give you a set of fasta files of 3' UTRs for all transcripts for your set of Entrez gene IDs, which contain the start and stop of each 3' UTR on genome coordinates. I believe this solution has the same problem of not accounting for introns in 3' UTRs, but because of the gene<->transcript<->UTR mapping, it will account for alternative 3' UTRs.

ADD COMMENT • link 13.3 years ago by Casey Bergman 18k

2

Entering edit mode

Paul, you can actually automate the query which Casey described by first doing it manually, then you click on 'Perl' in the upper right-hand corner of the interface and you will get a Perl program that retrieves the same query-results programmatically. You can then tweak the Perl program to your needs.

ADD REPLY • link 13.3 years ago by Joachim ★ 2.9k

0

Entering edit mode

Or, to save the clicking, you could hack up a little Perl script using the EnsEMBL Perl API ;-)

ADD REPLY • link 13.0 years ago by Steve Moss 2.3k

0

Entering edit mode

Regarding to 3' UTR, why i cannot search 3' UTR sequence on the cDNA transcript sequence? 5' UTR is search no problem.

ADD REPLY • link 12.7 years ago by Jirapong ▴ 20

score 8 · Answer 2 · 2016-07-22

8

Entering edit mode

7.8 years ago

marcosmorgan ▴ 120

This can be done using the GenomicFeatures library from Bioconductor (and dplyr)

I will use the refSeq transcripts ("refGene") from mouse ("mm10")

library(GenomicFeatures)
library(dplyr)

refSeq             <- makeTxDbFromUCSC(genom="mm10",tablename="refGene")                     
threeUTRs          <- threeUTRsByTranscript(refseq, use.names=TRUE)
length_threeUTRs   <- width(ranges(threeUTRs))
the_lengths        <- as.data.frame(length_threeUTRs)
the_lengths        <- the_lengths %>% group_by(group, group_name) %>% summarise(sum(value))
the_lengths        <- unique(the_lengths[,c("group_name", "sum(value)")])
colnames(the_lengths) <- c("RefSeq Transcript", "3' UTR Length")

The dataframe "the_lengths" has what you need.

ADD COMMENT • link 7.8 years ago by marcosmorgan ▴ 120

0

Entering edit mode

This was great, thanks. I needed to plot a histogram of 3'UTR lengths for a whole genome and this was a neat and accurate solution. I tried lots of other ways and none of them were successful. As an additional note I discovered that "refGene" is no longer a supported table so used the supportedUCSCtables() function to determine that "knownGene" is the currently supported table.

> supportedUCSCtables(genome = "mm10", url="http://genome.ucsc.edu/cgi-bin/")
             tablename          track          subtrack
1            knownGene     UCSC Genes              <NA>

truncated

I also noted that the gene with longest 3'UTR is an outlier and may actually be erroneous as it is a noncoding gene but for some reason there is a short coding track in its UCSC annotation which results init being annotated as having a long 3'UTR. The upper limit for 3'UTR lengths seems to be about 16k.

> the_lengths<- the_lengths[order(the_lengths$`3' UTR Length`,decreasing=TRUE),]
> head(the_lengths)
# A tibble: 6 x 2
  `RefSeq Transcript` `3' UTR Length`
                <chr>           <int>
1          uc012fxx.1           82649
2          uc009css.1           15929
3          uc009cst.1           15929
4          uc008drr.3           15586
5          uc008drs.2           15586
6          uc008dfj.2           14319

ADD REPLY • link 6.4 years ago by vperreau • 0

0

Entering edit mode

Hello everyone,

Similar to the @marcosmorgan job, you can follow your analysis to end up with a table with more information:

# get 3' UTR lengths and transcripts

# load libraries
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(magrittr)
txdb <-  TxDb.Hsapiens.UCSC.hg19.knownGene

# extract all 3' UTR regions by transcript
threeUTR <- threeUTRsByTranscript(txdb, use.name=TRUE)

# extract the names of the transcripts
transcripts_three_UTR <- names(threeUTR)

# keys and column for the query
keys <- transcripts_three_UTR
columns <- c("GENEID", "TXEND", "TXSTART", "TXSTRAND", "TXCHROM")

# query
df_three_utr <- select(txdb, 
                       keys=transcripts_three_UTR, 
                       columns=c(columns), 
                       keytype="TXNAME")

# computing the width of the 3' UTR regions
threeUTR_df <- as.data.frame(threeUTR)
threeUTR_df <- threeUTR_df %>% 
  dplyr::group_by(group, group_name) %>% 
  dplyr::summarize(sum(width))

# adding the information
df_three_utr$width_3_UTR <- threeUTR_df$`sum(width)`

# ordering
df_three_utr_sorted <- df_three_utr[order(as.numeric(df_three_utr$width_3_UTR), 
                                         decreasing=TRUE), ]


> head(df_three_utr_sorted)
          TXNAME    GENEID TXCHROM TXSTRAND   TXSTART     TXEND width_3_UTR
17445 uc010jgw.3     79628    chr5        - 148361713 148411250       22561
17446 uc003lpt.3     79628    chr5        - 148361713 148442737       22561
17447 uc003lpu.3     79628    chr5        - 148361713 148442737       22561
17448 uc010jgx.3     79628    chr5        - 148361713 148442737       22561
22344 uc021zzk.1 100131257    chr7        -   7115401   7136417       20462
8531  uc002sfp.2     22848    chr2        -  69685127  69870977       17874

Of course, you can order the table by the field of your interest.

ADD REPLY • link 2.6 years ago by t.montserrat.ayuso • 0

0

Entering edit mode

Similar to @vperreau, you can do that from the command-line, too. For example with a script like this:

# Extract the features 3' UTR
bioawk -c gff '$feature ~ /3UTR/ {print $0}' hg19-knownGene.gtf > three_utr.gff

# Cut fields of interest
cut -f 1,4,5,7,9 three_utr.gff > three_utr_regions.txt

# Compute the width for each 3' UTR
awk '{print $3-$2+1}' three_utr_regions.txt > width.txt

# Add width column
paste three_utr_regions.txt width.txt > three_utr_width.txt

# Reorder columns
awk -F"\t" -v OFS="\t" '{print $1,$2,$3,$6,$4,$5}' three_utr_width.txt > three_utr_info.txt

# Sorting by width
sort -k4,4nr three_utr_info.txt > three_utr_info_sorted.txt

# Extract all gene ID
cut -f6 three_utr_info_sorted.txt | awk -F" " '{print $4}' | sed 's/"//g' | sed 's/;//g' > transcript_id.txt

# Replacing metadata for transcript ID
cut -f1,2,3,4,5 three_utr_info_sorted.txt > three_utr_ranges.txt
paste three_utr_ranges.txt transcript_id.txt > three_utr_width_genes.txt

# Keeping the transcript ID and the width and sorting
cut -f4,6 three_utr_width_genes.txt | sort -k2,2 > three_utr_width_sorted.txt

# Computing the total width of 3' UTR of all transcripts
awk -v OFS="\t" '{a[$2] += $1} END{for (i in a) print i,a[i]}' three_utr_width_sorted.txt | sort -k2,2nr > three_utr_widths.txt

The finished product is a tab delimited file with each 3' UTR and its length sorted by decreasing width:

$ head -n3 three_utr_widths.txt
uc003lpt.3      22561
uc003lpu.3      22561
uc010jgw.3      22561

ADD REPLY • link 2.6 years ago by t.montserrat.ayuso • 0

Ram · Answer 3 · 2011-01-13

The table KnownGene in the UCSC database contains all the information you want about the structure of the gene (the positions of the introns, exons, cdsStart/end , txStart/end).

The table kgXref contains the NCBI id and is linked to KnownGene.

for the genes on the '+' strand the query would be (for rapidity, I won't take in account any splicing between the last codon and the end of the transcription, it would need more code than a simple SQL query ):

mysql  -h  genome-mysql.cse.ucsc.edu -A -u genome -D hg18
mysql>select distinct X.geneSymbol, K.txEnd-K.cdsEnd
 from
 kgXref as X,
 knownGene as K
 where
   X.geneSymbol!=""
   and K.name=X.kgId and
   K.strand="+" ;

+---------------+------------------+
| geneSymbol    | K.txEnd-K.cdsEnd |
+---------------+------------------+
| BC032353      |             3006 |
| AX748260      |             3157 |
| BC048429      |             1540 |
| OR4F5         |                0 |
| OR4F5         |                1 |
| DQ599874      |               31 |
| DQ599768      |               78 |
(..)

score 4 · Answer 4 · 2011-01-13

4

Entering edit mode

13.3 years ago

Larry_Parnell 16k

Hmmm, there is not typically alternative splicing of 3'-UTRs, but it can happen. There certainly are lots of examples of alternate terminal exons. So, I would not want to link gene symbol to 3'-UTR length, but rather gene symbol to mRNA identifier to its 3'-UTR length. Perhaps Pierre's table above shows that for gene OR4F5, but a length of zero is not a good test for one gene with 2 mRNA isoforms and hence two different, or not, 3'-UTR lengths.

Just something to consider...

ADD COMMENT • link 13.3 years ago by Larry_Parnell 16k

2

Entering edit mode

The gene I worked on in grad school had alternate 3'-UTRs. http://www.ncbi.nlm.nih.gov/pubmed/1487151 Differed by over 700 bp. That said, another caveat of pulling this from dbs is that I know for UCSC that they don't make the call on presence of a UTR, they rely on the genbank record. Absence of a UTR does not necessarily mean there isn't one.

ADD REPLY • link 13.3 years ago by Mary 11k