How to link UCSC transcripts ids to protein ids using bioconductor ?
2
1
Entering edit mode
8.0 years ago
Aurelie MLB ▴ 360

Hello,

I am new to this area so apologies if the answer is obvious.

I am currenlty using Bioconductor packages to access the UCSC genome and get the transcripts for my genes of interest. But I also would like to link those transcripts to a protein if they are actually translated. I could not find an easy way using Bioconductor. I could get the CDS and translate them I presume but I would like to find more than this and access their Ensembl Ids for instance.

Would someone know how to do this please?

Many thanks

genome sequence gene Forum R • 6.0k views
0
Entering edit mode

Hi,

I have a related question. I noticed that through biomaRt I can only access homo sapiens ensembl dataset "Homo sapiens genes (GRCh38.p2)". I also want to translate Ensembl transcript IDs into RefSeq IDs, but my Ensembl Transcript IDs are from GRCh37/hg19 built. Do you have any advice on a way to get these IDs retrieved through biomaRt like in the example above? Maybe advice on a better way to do it?

2
Entering edit mode
8.0 years ago
Martin Morgan ★ 1.6k

Your question doesn't really provide enough information, but maybe you're interested in the knownGenes track in a model organism, and there is already a Bioconductor package

> library(TxDb.Hsapiens.UCSC.hg19.knownGene)


From here you can discover available 'keytypes' and 'columns'

> keytypes(TxDb.Hsapiens.UCSC.hg19.knownGene)
> columns(TxDb.Hsapiens.UCSC.hg19.knownGene)


extract all the transcript ids

> txid = keys(TxDb.Hsapiens.UCSC.hg19.knownGene, "TXID")


and get their corresponding Entrez gene ids

> df = select(TxDb.Hsapiens.UCSC.hg19.knownGene, txid, "GENEID", "TXID")


> head(df)
GENEID  TXID
1      1 70455
2      1 70456
3     10 31944
4    100 72132
5   1000 65378
6   1000 65379


If you wanted more information about the genes, you might use library(org.Hs.eg.db) and then

> head(select(org.Hs.eg.db, df\$GENEID, c("SYMBOL", "GENENAME")))
ENTREZID SYMBOL                                              GENENAME
1        1   A1BG                                alpha-1-B glycoprotein
2        1   A1BG                                alpha-1-B glycoprotein
3       10   NAT2 N-acetyltransferase 2 (arylamine N-acetyltransferase)


Also, biomart is accessible through library(biomaRt). The package has a good vignette, available from the package landing page. See the introduction to Biocondcutor annotation work flows for some additional information. If you're more specific about what your needs are, then other approaches might be possible.

For more general annotations, the biomaRt package is very handy. The idea is discover the 'mart', 'dataset', 'filters' and 'attributes' available, via listMarts() etc., and then to compose a query

> library(biomaRt)
## listMarts(), listDatasets("ensembl"), etc
> mart <- useMart("ensembl", "hsapiens_gene_ensembl")
> filters <- "ensembl_transcript_id"      # info I'll provide, see listFilters(mart)
> attr <-                                 # info I want, ?listAttributes
c("ensembl_gene_id", "ensembl_transcript_id", "ensembl_peptide_id")
values = c("ENST00000275493", "ENST00000344576") # info I have


and then the query

> getBM(attr, filt, values, mart)
ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
1 ENSG00000146648       ENST00000344576    ENSP00000345973
2 ENSG00000146648       ENST00000275493    ENSP00000275493


An alternative to the final line, consistent with the use of select in other annotation resources, is

> select(mart, values, attr, filters)
ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
1 ENSG00000146648       ENST00000344576    ENSP00000345973
2 ENSG00000146648       ENST00000275493    ENSP00000275493


In truth I 'discovered' the relevant marts, data sets, etc., partly in R and partly by navigating the ensembl mart. Don't forget to check out the biomaRt vignette.

0
Entering edit mode

Hi Martin, Thank you so much for your answer. I was trying to link the transcripts to a protein product. For instance, on the Ensembl interface for a given gene (e.g. EGFR), I noticed that you can see several transcripts (e.g.: ENST00000275493 or ENST00000344576 ) and for each transcripts a protein is associated (e.g.: ENSP00000275493 or ENSP00000345973). I was trying to get to this kind of information using bioconductor and starting from USCS transcript ids.

0
Entering edit mode

I've updated my answer with how one can use biomaRt (the Bioconductor package) to query biomart (the online resource).

0
Entering edit mode

Thanks a lot ! This is really helpful !

0
Entering edit mode
8.0 years ago
Kizuna ▴ 850

if you do not have many transcript IDs, you can use biomart (ensembl) : http://www.ensembl.org/biomart/martview/b3b87cd3b220cf9d6d08d7de1a51fadd

you can also find easy tutorials for this tool :)

hope it helps

0
Entering edit mode

Hi Kizuna, Thanks a lot! The thing is I do have quite a few so I would like an automation of this.This is why I was interested by Bioconductor.