Question: How to get Gene symbols & nuclotide FASTA for taxid :1239
gravatar for anu014
4 weeks ago by
anu014140 wrote:

Hello Biostars!

I was trying to get Gene symbols for taxon 1239 (Firmicutes) from refseq_protein ids, but was unable to do so using Biodbnet ( Eg. 'WP_020487904.1' (

Even gene2refseq file - doesn't contain tax1239 or 1000277 (1239's species).

Can anyone tell me how to get all the genes & their respective FASTAs for Firmicutes ?

sequence genome gene • 137 views
ADD COMMENTlink modified 28 days ago by tdmurphy80 • written 4 weeks ago by anu014140

Do you need Gene symbol or gene sequences in the fasta format? Do you need this data for txid1239 or txid1000277?

For example, gene symbol info will be included in the gene table can be downloaded using following NCBI Unix eutils command.

esearch -db gene -query "txid1239[Organism:exp] "|efetch -format tabular
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Sej Modha2.7k

I want gene symbols n fasta sequences if input is refseq protein ids for taxon id 1239.

ADD REPLYlink written 4 weeks ago by anu014140

You can get the sequence by doing following:

esearch -db protein -query "txid1239[Organism:exp] "| efetch -format fasta > seq.fa
ADD REPLYlink written 4 weeks ago by genomax44k

I know it's primitive question but how to download esearch? It's throwing error : 'No command 'esearch' found' ...

ADD REPLYlink written 4 weeks ago by anu014140

Okay I got it now. One can download edirect suit from here : . It contains esearch n efetch programs.

ADD REPLYlink written 4 weeks ago by anu014140

Still it's not working @GenoMax. After running this it's showing me 'help' of efetch :

EFETCH - retrieve entries from sequence databases.

Synopsis: efetch -options [database:]<query>

Databases: SWissprot/SP, PIR, WOrmpep/WP, EMbl, GEnbank/GB, ProDom, ProSite

Options: -a Search with Accession number -f Fasta format output -q Sequence only output (one line) -s <#> Start at position # -e <#> Stop at position # -o More options and info...

-D <dir>      Specify database directory
-H            Display index header data
-p            Display entrynames in search path
-r            Print sequence in 'raw' format
-m            Fetch from mixed mini database
-M            Mini format output
-b            Do NOT reverse the order of bytes
                          (SunOS, IRIX do reverse, Alpha not)
-d <dbfile>   Specify database file (avoid this)
-i <idxfile>  Specify index file (avoid this)
-l <divfile>  Specify division lookup table (avoid this)
-B <database> Specify database (archaic)
-A            Only return entryname for accession number
-n <name>     Give the sequence this name
-x            Don't require query to match entry's name exactly (avoid)
-w            For Wormpep: also fetch cross-referenced SwissProt entry
-h            shows this help text

Environment: SWDIR = SwissProt directory - database and EMBL index files PIRDIR = PIR -- " -- WORMDIR = Wormpep -- " -- EMBLDIR = EMBL -- " -- GBDIR = Genbank -- " -- PRODOMDIR = ProDom -- " -- PROSITEDIR = ProSite -- " -- DBDIR = User's own -- " -- (fasta format)

SEQDB database file (default SwissProt) SEQDBIDX index file DIVTABL division lookup table

Ex. setenv DBDIR /pubseq/seqlibs/embl/

Note that Prodom family consensus seqs can be fetched by PD:_#

by Erik Sonnhammer ( Version 2.1,

ADD REPLYlink written 4 weeks ago by anu014140

I am not sure if you are using the correct version of edirect utils. Download the latest version of the eutils from: You can also have a look this blog for more info.

ADD REPLYlink written 4 weeks ago by Sej Modha2.7k
gravatar for tdmurphy
28 days ago by
tdmurphy80 wrote:

Many of the bacteria RefSeq genomes aren't available in NCBI's Gene database, so e-utils with the gene db won't work. If you have a specific set of assemblies in mind, try downloading the "feature_table.txt" files for that set and parsing what you need from there. e.g.: Then use the "download assemblies" button to download the "Feature table" file for the RefSeq assemblies. All Firmicutes is 35k assemblies and a 4.6GB download.

Your example protein is in this file: The genomic location is in columns 7-10, and the gene symbol (if available) is in column 15. You could then use e-utils to get the FASTA sequence for that genomic range.

If you want CDS nucleotide sequence (same as the gene sequence), with gene symbols in the FASTA headers, try the "CDS from genomic" file from that same download option (31.8 GB). Your example has a header like this:


To do that for individual proteins via e-utils, you could use something like:

# first use the IPG report to get the nucleotide accession and location
esearch -db protein -query WP_020487904 | esummary -format ipg | grep WP_020487904
41115784    RefSeq  NZ_AQYY01000001.1   424554  427151  +   WP_020487904.1  ATP-dependent chaperone ClpB    Dehalobacter sp. FTH1   FTH1    GCF_000372005.1

# then use that location from columns 3-6 to get the sequence:
efetch -db nuccore -id NZ_AQYY01000001.1 -seq_start 424554 -seq_stop 427151 -strand plus -format fasta_cds_na
>lcl|NZ_AQYY01000001.1_cds_WP_020487904.1_1 [gene=clpB] [locus_tag=A37G_RS0101875] [protein=ATP-dependent chaperone ClpB] [protein_id=WP_020487904.1] [location=424554..427151] [gbkey=CDS]

Keep in mind a single WP may be found on multiple assemblies (or even at multiple locations of the same assembly), so the IPG report may have multiple rows for the same WP accession.

Note only about 10% of the genes for that assembly have gene symbols assigned. Protein names on WPs are better defined than gene symbols.

ADD COMMENTlink written 28 days ago by tdmurphy80
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 600 users visited in the last hour