Question: Is there a way to do a batch extraction of DNA Coding Sequences from List of Protein Accessions (NCBI)?
gravatar for biobaker
25 days ago by
biobaker0 wrote:

Hello! I have been banging my head against a wall for the past 2 days trying to get a fasta file of DNA sequences from a list of protein accessions across a broad range of organisms. I have roamed biostars for questions asking the same thing (it should be a common enough task!) but all of the answers I've found don't get me quite what I want.

Here are a few of the protein accessions for reference (I have ~400 in total):


I would like the DNA coding sequence associated with these proteins. Of all the ways I've tried, these seem to be the 2 most promising approaches thus far.

Approach 1: Drawn directly from this answer on a previous post, I have tried to use eutils to link a protein query with the nuccore database. However, as stated in the linked answer, this returns ALL coding sequences from the nucleotide sequence, not just the CDS for the protein in question. For example, for a single protein accession, I would write:

esearch -query "WP_011071901.1" -db protein |elink -target nuccore|efetch -format fasta_cds_na

but this returns all CDS, not just the CDS for WP_011071901.1. Also, I would want to re-work this in a way where I can do a batch extraction of all my proteins, not just 1 at a time.

Approach 2: First find start and stop positions of protein coding sequence in associated nucleotide, then get DNA sequence from there.

I have successfully extracted the nucleotide accession and the start and stop positions already by making use of the code posted here.

Then, I know I can access the nucleotide sequence one at a time by typing

efetch -db nuccore -id NC_004347.2 -format fasta -seq_start 1859348 -seq_stop 1861363

but this is extremely inefficient given the number of proteins I am working with. I can also see this becoming an issue as some of these protein accessions are associated with more than 1 genome/nucleotide. I know which Nucleotide accession I would want to use for each associated protein if that helps for this approach.

Any guidance or help would be appreciated, thank you!

cds eutils protein nuccore ncbi • 198 views
ADD COMMENTlink written 25 days ago by biobaker0

You are working with WP* accessions which are special accessions potentially representing multiple organisms. Where did you get these from? Is there any chance of using non-WP* accessions? Your first solution will work in that case.

ADD REPLYlink modified 25 days ago • written 25 days ago by genomax92k

Unfortunately not, although I DO know which nucleotide records I would want to pull from. Is there a way to specify this in eutils?

ADD REPLYlink written 24 days ago by biobaker0

Are these genome records or individual genes? Can you provide an example?

ADD REPLYlink written 24 days ago by genomax92k

Certainly. These are usually whole genomes or contigs/scaffolds from a genome. For example, WP_011071901.1 is found on both NZ_CP053946.1 and NC_004347.2. These are from 2 different genome assemblies of the same organism. However, I know that I'd just like the DNA sequence encoding WP_011071901.1 from NC_004347.2. Let me know if there's other info I could provide. Thank you!

ADD REPLYlink modified 24 days ago • written 24 days ago by biobaker0

How many genomes do you have? Since you already have the accession numbers and start and stop of each gene, maybe you can download the fasta file for each genome, and use a tool like FragGeneScan [It's really fast], and get the gene sequences, and use grep -A 1 $start to extract the genes you're looking for. (grep -A 1 works with FragGeneScan outputs).

wget -O ${str}.fna "${str}"

 /bin/FragGeneScan1.31/ -genome=NC_004347.2.fna   -out=NC_004347.2-FGS.out  -complete=1 -train=complete thread=9

  grep -A 1 "1859348" NC_004347.2-FGS.ffn 

You can make a bash script to do this these.

Alternatively, you can download both fna and gff file from ncbi and convert the gff to gene sequence, and extract the one you're interested in.

Extract Cds Fastas From A Gff Annotation + Reference Sequence

ADD REPLYlink modified 24 days ago • written 24 days ago by Fatima820

Thank you! I am working with ~160 genomes, so I'd prefer not to download each genome if possible. Also, because I have several proteins per genome (and nucleotide record in some cases), it would also be ideal if I could name each fasta record with the corresponding protein accession, rather than the nucleotide accession and start/stop info. Thank you for both of these options though.

ADD REPLYlink written 24 days ago by biobaker0

it would also be ideal if I could name each fasta record with the corresponding protein accession, rather than the nucleotide accession and start/stop info.

I think you can easily use sed to replace them. Something like this:

sed "s/>$var1/>$var2/g" "$file"
ADD REPLYlink modified 24 days ago • written 24 days ago by Fatima820
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1252 users visited in the last hour