Question

Refseq fasta files from different species

1

Entering edit mode

7.9 years ago

llltrunkslll • 0

Hello All,

I'm new to the forum and I'm looking for a bit of help. I'd like to pull down the Refseq fasta protein sequences of a gene (all isoforms), however I'd like to do that with several species. For example, I'd like to get the gene ACTB for human (which has only 1 isoform NP_001092.1), but also get sequences for mouse (NP_031419.1), and macaca fascicularis ( NP_001271954.1). This is easy for an example since each only has one isoform, and the monkey assembly is well annotated for this gene so it doesn't have the XP_ type sequence it usually does.

This procedure is a bit tricky because sometimes the gene I'm pulling may not exist in one of the other species (i.e. not present in mouse), and importantly I need them all to be from the same reference source build (i.e. should all be RefSeq or all Ensembl). I have a very strong preference for RefSeq though.

Here's what I've tried using other methods in R, but no success:

library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(rentrez)
symbols <- c("^ACTB$")
grep(symbols, keys(x = org.Hs.eg.db, keytype = "SYMBOL"), ignore.case = T, perl = T) #test if gene found in human
grep(symbols, keys(x = org.Mm.eg.db, keytype = "SYMBOL"), ignore.case = T, perl = T) #test if gene found in mouse
keys.h <- keys(x = org.Hs.eg.db, keytype = "SYMBOL")[grep(symbols, keys(x = org.Hs.eg.db, 
                       keytype = "SYMBOL"), ignore.case = T, perl = T)]
keys.m <- keys(x = org.Mm.eg.db, keytype = "SYMBOL")[grep(symbols, keys(x = org.Mm.eg.db, 
                       keytype = "SYMBOL"), ignore.case = T, perl = T)]
#map HGNC to ID#
id.h <- mapIds(x = org.Hs.eg.db, keys = keys.h, keytype = 'SYMBOL', column = 'ENTREZID')
id.m <- mapIds(x = org.Mm.eg.db, keys = keys.m, keytype = 'SYMBOL', column = 'ENTREZID')
select(x = org.Hs.eg.db, keys = id.h, columns = c("ENTREZID", "REFSEQ", "SYMBOL"))
select(x = org.Mm.eg.db, keys = id.m, columns = c("ENTREZID", "REFSEQ", "SYMBOL"))
#pull Human protein data
id.link <- entrez_link(dbfrom = "gene", id = id.h, db = "protein")
id.prot <- id.link$links$gene_protein_refseq #this is where no protein is found
id.seqs <- entrez_fetch(db = "protein", id = id.prot, rettype = "fasta")

As you can see I'm able to retrieve human and mouse decently, but unfortunately I can't figure out a way to pull the monkey protein sequences. I can't find an equivalent "org.Mf.eg.db"database for the monkey, that would pull the NP_001271954.1 sequence. Can anyone help me figure this one out?

Thank you in advance!

sequence R NCBI RefSeq ortholog • 2.3k views

ADD COMMENT • link 7.8 years ago by llltrunkslll • 0

0

Entering edit mode

Hey All,

It's been about 3 weeks and 102 views, so it seems like an interesting topic to people. Could anyone please assist?

ADD REPLY • link 7.8 years ago by llltrunkslll • 0

score 1 · Accepted Answer · 2016-07-26

well for any poor sod that comes across this post and wanted to know, I solved my own problem by utilizing the BSGenomes package and GenomicFeatures package:

> library(GenomicFeatures) cyno.5.0 <- makeTxDbFromUCSC(genome =
> "macFas5", tablename = "refGene") saveDb(x = cyno.5.0, file =
> "TxDb_cyno_UCSC_5_refGene")

This will build a TxDb object that can be used to get the sequences for translation. A bit convoluted but works for the less used organisms. This could be amended for another species by simply scanning other available builds (uses rtracklayer package loaded as dependency):

ucscGenomes() #scroll through to find relevant organisms