Question

3'UTR sequence for whole transcriptomes from biomart

0

Entering edit mode

9.3 years ago

dolevrahat ▴ 30

I am looking for way to get sequences of 3'UTRs for entire transcriptomes, for a few dozen species.

I have tried doing this via biomart, but was unable to find this data in the FTP and downloading the data for so many species manually is unfeasible.

I have also tried to get the data via R biomaRt package, using the following code:

ensembl=useMart("ensembl",dataset="trubripes_gene_ensembl")
genes<-getBM(mart=ensembl,attributes="ensembl_gene_id")
s<-getSequence(seqType='3utr',mart=ensembl,type="ensembl_gene_id",id=genes[,1])

But the out indicated "Sequence unavailable" for about 95% of the genes. On the other hand, when I tried the same with a subset of 100 mouse genes, I received more then a 100 matches, with multiple non duplicated sequence matching a single Ensembl gene.

What would be the right approach to accomplish this task?

Thanks in advance

Dolev Rahat

biomart R ensembl • 4.2k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by dolevrahat ▴ 30

1

Entering edit mode

You're already using the most obvious method. For the species with many instances of "sequence unavailable", have you look at their annotations to see if they have much in the way of annotated UTRs?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Devon Ryan 104k

Ram · Accepted Answer · 2015-01-16

5

Entering edit mode

9.3 years ago

Emily 23k

You can read more about Fugu gene annotation here.

Some species (eg human, mouse and zebrafish) have vast amounts of species-specific mRNA sequences that can be plotted onto their genomes, producing gene models that have 3' UTRs. Other species (eg Fugu) have limited amounts of species-specific data, mostly protein, and most of their annotation is done by plotting protein sequences from other species onto their genomes (we use proteins as we find that we get better hits with proteins than mRNAs, due to synonymous changes). If we're using proteins to produce gene models, we don't get UTRs.

ADD COMMENT • link 9.3 years ago by Emily 23k

0

Entering edit mode

Thanks. That's good to know. Is there any way that I can infer from the annotations how many UTRs are annotated for a given genome? That way I will be able to choose the genomes for my project in a more educated manner.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by dolevrahat ▴ 30

2

Entering edit mode

No straightforward way. There will be more UTRs in species that are more well annotated generally. I would recommend looking at human, the main model organisms (mouse, zebrafish, rat), agricultural species (pig is best) and anything that has an RNASeq genebuild. RNASeq genebuilds use an RNA transcriptome from that species, so should contain the UTRs. Have a look at the species homepages (e.g. sheep) to see if they have RNASeq in their genebuilds.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Emily 23k

0

Entering edit mode

Thanks. For the benefit of future readers of this post I'll mention that I was able eventually to extract the information on the number of annotated UTRs relative to the number of annotated genes/ transcripts for a genome using the following R code (warning: running this code for large number of genomes may take many hours):

countUTRs<-function(martString,ds){
  mart<-useMart(martString,dataset=ds)
  geneStart<-getBM(attributes="start_position",mart=mart)
  transStart<-getBM(attribute="transcript_start",mart=mart)
  UTRstart<-getBM(attributes="3_utr_start",mart=mart)
  nUTR=nrow(UTRstart)
  nGene=nrow(geneStart)
  nTrans=nrow(transStart)
  countTable<-c(martString,ds,nUTR,nGene,nUTR/nGene,nTrans, nUTR/nTrans)
  countTable
}

countTable<-countUTRs(YOUR_MART_OF_INTEREST, YOUR_GENOME_OF_INTEREST)

for example: countUTRs("plants_mart_24","opunctata_eg_gene")

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by dolevrahat ▴ 30