Question

Nucleotide CDS from RefSeq?

0

Entering edit mode

9.8 years ago

Prohan ▴ 350

Hi All,

I'm trying to retrieve the nucleotide sequences of the complete RefSeq protein CDS's. I've looked at the files at ftp://ftp.ncbi.nih.gov/refseq/release/complete/ but I can't seem to find a file that has the CDS + the original nucleotide (whole genome) sequence that the CDS came from.

I don't have a problem parsing genbank files - just seems odd that there isn't one genbank file that has the information I need.

I could add the whole genome sequences to the genbank files that have the CDS info. Just seems like I'm missing something obvious here.

Here's the general problem I'm trying to solve:

I have a protein with accession "CAA23625" from RefSeq - I'd like the nucleotide sequence of the CDS. Ideally I'd like to do the parsing locally without having to really on hitting NCBI's server with an Entrez query. Thanks,

Rohan

ncbi biopython genbank • 3.0k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Prohan ▴ 350

1

Entering edit mode

What is your question? Perhaps a specific example might help...

ADD REPLY • link 9.8 years ago by Peter 6.0k

Ram · Answer 1 · 2014-06-24

0

Entering edit mode

9.8 years ago

eddie.im ▴ 140

Get a list of protein acession number (in uniprot you can download then easily), then convert those to "EMBL CDS" through Uniprot "IDmaping" tab. Then use bpfetch (bioperl) with a loop. That's how I did it.

while read accession_number ; 
do bp_fetch net::embl:${accession_number} ; 
done < accessions.list > results.txt

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by eddie.im ▴ 140

0

Entering edit mode

Thanks for the info. I'm trying to do it locally by parsing the genbank files rather than htting the NCBI/embl servers a ton.

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Prohan ▴ 350