As an alternative web-based solution the following will give you 3' UTR information for each transcript of a gene, but will require a little subtraction, try the Ensembl Biomart, e.g.
- Choose Database -> Ensembl Genes 60
- Choose Dataset -> Homo Sapiens Genes
- Click "Filters" on left hand menubar
- Expand "Gene" section by clicking "+"
- Select "ID list limit" check box.
- Select Entrez gene IDs from ID list limit drop down menu
- Paste in list of Entrez gene IDs
- Click "Attributes" on left hand menubar
- Click "Sequences" radio button
- Expand "Sequences" section by clicking "+"
- Check "3' UTR" under "sequences" header
- Expand "Header information" section by clicking "+"
- Check "3' UTR start" and "3' UTR end" and "Transcript name" under "Transcript information" header
- Click "Results" button at top left.
This will give you a set of fasta files of 3' UTRs for all transcripts for your set of Entrez gene IDs, which contain the start and stop of each 3' UTR on genome coordinates. I believe this solution has the same problem of not accounting for introns in 3' UTRs, but because of the gene<->transcript<->UTR mapping, it will account for alternative 3' UTRs.
KnownGene in the UCSC database contains all the information you want about the structure of the gene (the positions of the introns, exons, cdsStart/end , txStart/end).
kgXref contains the NCBI id and is linked to
for the genes on the '+' strand the query would be (for rapidity, I won't take in account any splicing between the last codon and the end of the transcription, it would need more code than a simple SQL query ):
mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18 mysql>select distinct X.geneSymbol, K.txEnd-K.cdsEnd from kgXref as X, knownGene as K where X.geneSymbol!="" and K.name=X.kgId and K.strand="+" ; +---------------+------------------+ | geneSymbol | K.txEnd-K.cdsEnd | +---------------+------------------+ | BC032353 | 3006 | | AX748260 | 3157 | | BC048429 | 1540 | | OR4F5 | 0 | | OR4F5 | 1 | | DQ599874 | 31 | | DQ599768 | 78 | (..)
This can be done using the GenomicFeatures library from Bioconductor (and dplyr)
I will use the refSeq transcripts ("refGene") from mouse ("mm10")
library(GenomicFeatures) library(dplyr) refSeq <- makeTxDbFromUCSC(genom="mm10",tablename="refGene") threeUTRs <- threeUTRsByTranscript(refseq, use.names=TRUE) length_threeUTRs <- width(ranges(threeUTRs)) the_lengths <- as.data.frame(length_threeUTRs) the_lengths <- the_lengths %>% group_by(group, group_name) %>% summarise(sum(value)) the_lengths <- unique(the_lengths[,c("group_name", "sum(value)")]) colnames(the_lengths) <- c("RefSeq Transcript", "3' UTR Length")
The dataframe "the_lengths" has what you need.
Hmmm, there is not typically alternative splicing of 3'-UTRs, but it can happen. There certainly are lots of examples of alternate terminal exons. So, I would not want to link gene symbol to 3'-UTR length, but rather gene symbol to mRNA identifier to its 3'-UTR length. Perhaps Pierre's table above shows that for gene OR4F5, but a length of zero is not a good test for one gene with 2 mRNA isoforms and hence two different, or not, 3'-UTR lengths.
Just something to consider...