Question: Refseq fasta files from different species
1
gravatar for llltrunkslll
3.3 years ago by
llltrunkslll0 wrote:

Hello All,

I'm new to the forum and I'm looking for a bit of help. I'd like to pull down the Refseq fasta protein sequences of a gene (all isoforms), however I'd like to do that with several species. For example, I'd like to get the gene ACTB for human (which has only 1 isoform NP_001092.1), but also get sequences for mouse (NP_031419.1), and macaca fascicularis ( NP_001271954.1). This is easy for an example since each only has one isoform, and the monkey assembly is well annotated for this gene so it doesn't have the XP_ type sequence it usually does.

This procedure is a bit tricky because sometimes the gene I'm pulling may not exist in one of the other species (i.e. not present in mouse), and importantly I need them all to be from the same reference source build (i.e. should all be RefSeq or all Ensembl). I have a very strong preference for RefSeq though.

Here's what I've tried using other methods in R, but no success:

library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(rentrez)
symbols <- c("^ACTB$")
grep(symbols, keys(x = org.Hs.eg.db, keytype = "SYMBOL"), ignore.case = T, perl = T) #test if gene found in human
grep(symbols, keys(x = org.Mm.eg.db, keytype = "SYMBOL"), ignore.case = T, perl = T) #test if gene found in mouse
keys.h <- keys(x = org.Hs.eg.db, keytype = "SYMBOL")[grep(symbols, keys(x = org.Hs.eg.db, 
                       keytype = "SYMBOL"), ignore.case = T, perl = T)]
keys.m <- keys(x = org.Mm.eg.db, keytype = "SYMBOL")[grep(symbols, keys(x = org.Mm.eg.db, 
                       keytype = "SYMBOL"), ignore.case = T, perl = T)]
#map HGNC to ID#
id.h <- mapIds(x = org.Hs.eg.db, keys = keys.h, keytype = 'SYMBOL', column = 'ENTREZID')
id.m <- mapIds(x = org.Mm.eg.db, keys = keys.m, keytype = 'SYMBOL', column = 'ENTREZID')
select(x = org.Hs.eg.db, keys = id.h, columns = c("ENTREZID", "REFSEQ", "SYMBOL"))
select(x = org.Mm.eg.db, keys = id.m, columns = c("ENTREZID", "REFSEQ", "SYMBOL"))
#pull Human protein data
id.link <- entrez_link(dbfrom = "gene", id = id.h, db = "protein")
id.prot <- id.link$links$gene_protein_refseq #this is where no protein is found
id.seqs <- entrez_fetch(db = "protein", id = id.prot, rettype = "fasta")

As you can see I'm able to retrieve human and mouse decently, but unfortunately I can't figure out a way to pull the monkey protein sequences. I can't find an equivalent "org.Mf.eg.db"database for the monkey, that would pull the NP_001271954.1 sequence. Can anyone help me figure this one out?

Thank you in advance!

refseq sequence ortholog R ncbi • 1.2k views
ADD COMMENTlink modified 3.2 years ago • written 3.3 years ago by llltrunkslll0

Hey All,

It's been about 3 weeks and 102 views, so it seems like an interesting topic to people. Could anyone please assist?

ADD REPLYlink written 3.2 years ago by llltrunkslll0
1
gravatar for llltrunkslll
3.2 years ago by
llltrunkslll0 wrote:

well for any poor sod that comes across this post and wanted to know, I solved my own problem by utilizing the BSGenomes package and GenomicFeatures package:

> library(GenomicFeatures) cyno.5.0 <- makeTxDbFromUCSC(genome =
> "macFas5", tablename = "refGene") saveDb(x = cyno.5.0, file =
> "TxDb_cyno_UCSC_5_refGene")

This will build a TxDb object that can be used to get the sequences for translation. A bit convoluted but works for the less used organisms. This could be amended for another species by simply scanning other available builds (uses rtracklayer package loaded as dependency):

ucscGenomes() #scroll through to find relevant organisms
ADD COMMENTlink written 3.2 years ago by llltrunkslll0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1225 users visited in the last hour