Hello All,
I'm new to the forum and I'm looking for a bit of help. I'd like to pull down the Refseq fasta protein sequences of a gene (all isoforms), however I'd like to do that with several species. For example, I'd like to get the gene ACTB for human (which has only 1 isoform NP_001092.1), but also get sequences for mouse (NP_031419.1), and macaca fascicularis ( NP_001271954.1). This is easy for an example since each only has one isoform, and the monkey assembly is well annotated for this gene so it doesn't have the XP_ type sequence it usually does.
This procedure is a bit tricky because sometimes the gene I'm pulling may not exist in one of the other species (i.e. not present in mouse), and importantly I need them all to be from the same reference source build (i.e. should all be RefSeq or all Ensembl). I have a very strong preference for RefSeq though.
Here's what I've tried using other methods in R, but no success:
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(rentrez)
symbols <- c("^ACTB$")
grep(symbols, keys(x = org.Hs.eg.db, keytype = "SYMBOL"), ignore.case = T, perl = T) #test if gene found in human
grep(symbols, keys(x = org.Mm.eg.db, keytype = "SYMBOL"), ignore.case = T, perl = T) #test if gene found in mouse
keys.h <- keys(x = org.Hs.eg.db, keytype = "SYMBOL")[grep(symbols, keys(x = org.Hs.eg.db,
keytype = "SYMBOL"), ignore.case = T, perl = T)]
keys.m <- keys(x = org.Mm.eg.db, keytype = "SYMBOL")[grep(symbols, keys(x = org.Mm.eg.db,
keytype = "SYMBOL"), ignore.case = T, perl = T)]
#map HGNC to ID#
id.h <- mapIds(x = org.Hs.eg.db, keys = keys.h, keytype = 'SYMBOL', column = 'ENTREZID')
id.m <- mapIds(x = org.Mm.eg.db, keys = keys.m, keytype = 'SYMBOL', column = 'ENTREZID')
select(x = org.Hs.eg.db, keys = id.h, columns = c("ENTREZID", "REFSEQ", "SYMBOL"))
select(x = org.Mm.eg.db, keys = id.m, columns = c("ENTREZID", "REFSEQ", "SYMBOL"))
#pull Human protein data
id.link <- entrez_link(dbfrom = "gene", id = id.h, db = "protein")
id.prot <- id.link$links$gene_protein_refseq #this is where no protein is found
id.seqs <- entrez_fetch(db = "protein", id = id.prot, rettype = "fasta")
As you can see I'm able to retrieve human and mouse decently, but unfortunately I can't figure out a way to pull the monkey protein sequences. I can't find an equivalent "org.Mf.eg.db"database for the monkey, that would pull the NP_001271954.1 sequence. Can anyone help me figure this one out?
Thank you in advance!
Hey All,
It's been about 3 weeks and 102 views, so it seems like an interesting topic to people. Could anyone please assist?