Nucleotide CDS from RefSeq?
1
0
Entering edit mode
9.8 years ago
Prohan ▴ 350

Hi All,

I'm trying to retrieve the nucleotide sequences of the complete RefSeq protein CDS's. I've looked at the files at ftp://ftp.ncbi.nih.gov/refseq/release/complete/ but I can't seem to find a file that has the CDS + the original nucleotide (whole genome) sequence that the CDS came from.

I don't have a problem parsing genbank files - just seems odd that there isn't one genbank file that has the information I need.

I could add the whole genome sequences to the genbank files that have the CDS info. Just seems like I'm missing something obvious here.

Here's the general problem I'm trying to solve:

I have a protein with accession "CAA23625" from RefSeq - I'd like the nucleotide sequence of the CDS. Ideally I'd like to do the parsing locally without having to really on hitting NCBI's server with an Entrez query. Thanks,

Rohan

ncbi biopython genbank • 3.0k views
ADD COMMENT
1
Entering edit mode

What is your question? Perhaps a specific example might help...

ADD REPLY
0
Entering edit mode
9.8 years ago
eddie.im ▴ 140

Get a list of protein acession number (in uniprot you can download then easily), then convert those to "EMBL CDS" through Uniprot "IDmaping" tab. Then use bpfetch (bioperl) with a loop. That's how I did it.

while read accession_number ; 
do bp_fetch net::embl:${accession_number} ; 
done < accessions.list > results.txt
ADD COMMENT
0
Entering edit mode

Thanks for the info. I'm trying to do it locally by parsing the genbank files rather than htting the NCBI/embl servers a ton.

ADD REPLY

Login before adding your answer.

Traffic: 1651 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6