Question: How to link UCSC transcripts ids to protein ids using bioconductor ?
1
gravatar for Aurelie MLB
4.5 years ago by
Aurelie MLB310
United Kingdom
Aurelie MLB310 wrote:

Hello,

I am new to this area so apologies if the answer is obvious.

I am currenlty using Bioconductor packages to access the UCSC genome and get the transcripts for my genes of interest. But I also would like to link those transcripts to a protein if they are actually translated. I could not find an easy way using Bioconductor. I could get the CDS and translate them I presume but I would like to find more than this and access their Ensembl Ids for instance.

Would someone know how to do this please?

Many thanks

 

 

gene sequence forum R genome • 4.0k views
ADD COMMENTlink modified 3.6 years ago by ola.o40 • written 4.5 years ago by Aurelie MLB310
2
gravatar for Martin Morgan
4.5 years ago by
Martin Morgan1.6k
United States
Martin Morgan1.6k wrote:

Your question doesn't really provide enough information, but maybe you're interested in the knownGenes track in a model organism, and there is already a Bioconductor package

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

From here you can discover available 'keytypes' and 'columns'

keytypes(TxDb.Hsapiens.UCSC.hg19.knownGene)
columns(TxDb.Hsapiens.UCSC.hg19.knownGene)

extract all the transcript ids

txid = keys(TxDb.Hsapiens.UCSC.hg19.knownGene, "TXID")

and get their corresponding Entrez gene ids

df = select(TxDb.Hsapiens.UCSC.hg19.knownGene, txid, "GENEID", "TXID")

leading to

> head(df)
  GENEID  TXID
1      1 70455
2      1 70456
3     10 31944
4    100 72132
5   1000 65378
6   1000 65379

If you wanted more information about the genes, you might use library(org.Hs.eg.db) and then

> head(select(org.Hs.eg.db, df$GENEID, c("SYMBOL", "GENENAME")))
  ENTREZID SYMBOL                                              GENENAME
1        1   A1BG                                alpha-1-B glycoprotein
2        1   A1BG                                alpha-1-B glycoprotein
3       10   NAT2 N-acetyltransferase 2 (arylamine N-acetyltransferase)
4      100    ADA                                   adenosine deaminase
5     1000   CDH2             cadherin 2, type 1, N-cadherin (neuronal)
6     1000   CDH2             cadherin 2, type 1, N-cadherin (neuronal)

 

Also, biomart is accessible through library(biomaRt). The package has a good vignette, available from the package landing page. See the introduction to Biocondcutor annotation work flows for some additional information. If you're more specific about what your needs are, then other approaches might be possible.

For more general annotations, the biomaRt package is very handy. The idea is discover the 'mart', 'dataset', 'filters' and 'attributes' available, via listMarts() etc., and then to compose a query

library(biomaRt)
## listMarts(), listDatasets("ensembl"), etc
mart <- useMart("ensembl", "hsapiens_gene_ensembl")
filters <- "ensembl_transcript_id"      # info I'll provide, see listFilters(mart)
attr <-                                 # info I want, ?listAttributes
    c("ensembl_gene_id", "ensembl_transcript_id", "ensembl_peptide_id") 
values = c("ENST00000275493", "ENST00000344576") # info I have

and then the query

> getBM(attr, filt, values, mart)
  ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
1 ENSG00000146648       ENST00000344576    ENSP00000345973
2 ENSG00000146648       ENST00000275493    ENSP00000275493

An alternative to the final line, consistent with the use of select in other annotation resources, is

> select(mart, values, attr, filters)
  ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
1 ENSG00000146648       ENST00000344576    ENSP00000345973
2 ENSG00000146648       ENST00000275493    ENSP00000275493

 

In truth I 'discovered' the relevant marts, data sets, etc., partly in R and partly by navigating the ensembl mart. Don't forget to check out the biomaRt vignette.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Martin Morgan1.6k

Hi Martin, Thank you so much for your answer. I was trying to link the transcripts to a protein product. For instance, on the Ensembl  interface for a given gene (e.g. EGFR), I noticed that you can see several transcripts (e.g.: ENST00000275493 or ENST00000344576 ) and for each transcripts a protein is associated (e.g.: ENSP00000275493 or ENSP00000345973). I was trying to get to this kind of information using bioconductor and starting from USCS transcript ids.

ADD REPLYlink written 4.5 years ago by Aurelie MLB310

I've updated my answer with how one can use biomaRt (the Bioconductor package) to query biomart (the online resource).

ADD REPLYlink written 4.5 years ago by Martin Morgan1.6k

Thanks a lot ! This is really helpful !

ADD REPLYlink written 4.5 years ago by Aurelie MLB310
0
gravatar for Kizuna
4.5 years ago by
Kizuna750
France, Paris
Kizuna750 wrote:

if you do not have many transcript IDs, you can use biomart (ensembl) . : http://www.ensembl.org/biomart/martview/b3b87cd3b220cf9d6d08d7de1a51fadd

you can also find easy tutorials for this tool :)

hople it helps

ADD COMMENTlink written 4.5 years ago by Kizuna750

Hi Kizuna, Thanks a lot! The thing is I do have quite a few so I would like an automation of this.This is why I was interested by Bioconductor.

ADD REPLYlink written 4.5 years ago by Aurelie MLB310
0
gravatar for ola.o4
3.6 years ago by
ola.o40
ola.o40 wrote:

Hi, 

I have a related question. I noticed that through biomaRt I can only access homo sapiens ensembl dataset "Homo sapiens genes (GRCh38.p2)". I also want to translate Ensembl transcript IDs into RefSeq IDs, but my Ensembl Transcript IDs are from GRCh37/hg19 built. Do you have any advice on a way to get these IDs retrieved through biomaRt like in the example above? Maybe advice on a better way to do it?

I'd appreciate your advice lots!

ADD COMMENTlink written 3.6 years ago by ola.o40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1305 users visited in the last hour