As an alternative web-based solution the following will give you 3' UTR information for each transcript of a gene, but will require a little subtraction, try the Ensembl Biomart, e.g.
- Choose Database -> Ensembl Genes 60
- Choose Dataset -> Homo Sapiens Genes
- Click "Filters" on left hand menubar
- Expand "Gene" section by clicking "+"
- Select "ID list limit" check box.
- Select Entrez gene IDs from ID list limit drop down menu
- Paste in list of Entrez gene IDs
- Click "Attributes" on left hand menubar
- Click "Sequences" radio button
- Expand "Sequences" section by clicking "+"
- Check "3' UTR" under "sequences" header
- Expand "Header information" section by clicking "+"
- Check "3' UTR start" and "3' UTR end" and "Transcript name" under "Transcript information" header
- Click "Results" button at top left.
This will give you a set of fasta files of 3' UTRs for all transcripts for your set of Entrez gene IDs, which contain the start and stop of each 3' UTR on genome coordinates. I believe this solution has the same problem of not accounting for introns in 3' UTRs, but because of the gene<->transcript<->UTR mapping, it will account for alternative 3' UTRs.
This can be done using the GenomicFeatures library from Bioconductor (and dplyr)
I will use the refSeq transcripts ("refGene") from mouse ("mm10")
library(GenomicFeatures) library(dplyr) refSeq <- makeTxDbFromUCSC(genom="mm10",tablename="refGene") threeUTRs <- threeUTRsByTranscript(refseq, use.names=TRUE) length_threeUTRs <- width(ranges(threeUTRs)) the_lengths <- as.data.frame(length_threeUTRs) the_lengths <- the_lengths %>% group_by(group, group_name) %>% summarise(sum(value)) the_lengths <- unique(the_lengths[,c("group_name", "sum(value)")]) colnames(the_lengths) <- c("RefSeq Transcript", "3' UTR Length")
The dataframe "the_lengths" has what you need.
KnownGene in the UCSC database contains all the information you want about the structure of the gene (the positions of the introns, exons, cdsStart/end , txStart/end).
kgXref contains the NCBI id and is linked to
for the genes on the '+' strand the query would be (for rapidity, I won't take in account any splicing between the last codon and the end of the transcription, it would need more code than a simple SQL query ):
mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18 mysql>select distinct X.geneSymbol, K.txEnd-K.cdsEnd from kgXref as X, knownGene as K where X.geneSymbol!="" and K.name=X.kgId and K.strand="+" ; +---------------+------------------+ | geneSymbol | K.txEnd-K.cdsEnd | +---------------+------------------+ | BC032353 | 3006 | | AX748260 | 3157 | | BC048429 | 1540 | | OR4F5 | 0 | | OR4F5 | 1 | | DQ599874 | 31 | | DQ599768 | 78 | (..)
Hmmm, there is not typically alternative splicing of 3'-UTRs, but it can happen. There certainly are lots of examples of alternate terminal exons. So, I would not want to link gene symbol to 3'-UTR length, but rather gene symbol to mRNA identifier to its 3'-UTR length. Perhaps Pierre's table above shows that for gene OR4F5, but a length of zero is not a good test for one gene with 2 mRNA isoforms and hence two different, or not, 3'-UTR lengths.
Just something to consider...