Question: 3'UTR sequence for whole transcriptomes from biomart
0
gravatar for dolevrahat
4.6 years ago by
dolevrahat10
Israel
dolevrahat10 wrote:

I am looking for way to get sequences of 3'UTRs for entire transcriptomes, for a few dozen species.

I have tried doing this via biomart, but was unable to find this data in the FTP and downloading the data for so many species manually is unfeasible.

I have also tried to get the data via R biomaRt package, using the following code:

ensembl=useMart("ensembl",dataset="trubripes_gene_ensembl")

genes<-getBM(mart=ensembl,attributes="ensembl_gene_id")

s<-getSequence(seqType='3utr',mart=ensembl,type="ensembl_gene_id",id=genes[,1])

But the out indicated "Sequence unavailable" for about 95% of the genes. On the other hand, when I tried the same with a subset of 100 mouse genes, I received more then a 100 matches, with multiple non duplicated sequence matching a single Ensembl gene.

What would be the right approach to accomplish this task?

Thanks in advance

Dolev Rahat

 

ensembl biomart R • 2.3k views
ADD COMMENTlink modified 4.6 years ago by Emily_Ensembl18k • written 4.6 years ago by dolevrahat10
1

You're already using the most obvious method. For the species with many instances  of "sequence unavailable", have you look at their annotations to see if they have much in the way of annotated UTRs?

ADD REPLYlink written 4.6 years ago by Devon Ryan91k
5
gravatar for Emily_Ensembl
4.6 years ago by
Emily_Ensembl18k
EMBL-EBI
Emily_Ensembl18k wrote:

You can read more about Fugu gene annotation here.

Some species (eg human, mouse and zebrafish) have vast amounts of species-specific mRNA sequences that can be plotted onto their genomes, producing gene models that have 3' UTRs. Other species (eg Fugu) have limited amounts of species-specific data, mostly protein, and most of their annotation is done by plotting protein sequences from other species onto their genomes (we use proteins as we find that we get better hits with proteins than mRNAs, due to synonymous changes). If we're using proteins to produce gene models, we don't get UTRs.

ADD COMMENTlink written 4.6 years ago by Emily_Ensembl18k

Thanks. That's good to know. Is there any way that I can infer from the annotations how many UTRs are annotated for a given genome? That way I will be able to choose the genomes for my project in a more educated manner.

 

ADD REPLYlink written 4.6 years ago by dolevrahat10
2

No straightforward way. There will be more UTRs in species that are more well annotated generally. I would recommend looking at human, the main model organisms (mouse, zebrafish, rat), agricultural species (pig is best) and anything that has an RNASeq genebuild. RNASeq genebuilds use an RNA transcriptome from that species, so should contain the UTRs. Have a look at the species homepages (eg sheep) to see if they have RNASeq in their genebuilds.

ADD REPLYlink written 4.6 years ago by Emily_Ensembl18k

Thanks. For the benefit of future readers of this post I'll mention that I was able eventually to extract the information on the number of annotated UTRs relative to the number of annotated genes/ transcripts for a genome using the following R code (warning: running this code for large number of genomes may take many hours):

countUTRs<-function(martString,ds){
  mart<-useMart(martString,dataset=ds)
  geneStart<-getBM(attributes="start_position",mart=mart)
  transStart<-getBM(attribute="transcript_start",mart=mart)
  UTRstart<-getBM(attributes="3_utr_start",mart=mart)
  nUTR=nrow(UTRstart)
  nGene=nrow(geneStart)
  nTrans=nrow(transStart)
  countTable<-c(martString,ds,nUTR,nGene,nUTR/nGene,nTrans, nUTR/nTrans)
  countTable
}

countTable<-countUTRs(YOUR_MART_OF_INTEREST, YOUR_GENOME_OF_INTEREST)

for example: countUTRs("plants_mart_24","opunctata_eg_gene")

ADD REPLYlink written 4.6 years ago by dolevrahat10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 540 users visited in the last hour