Question: How to get Gene symbols & nuclotide FASTA for taxid :1239
0
gravatar for anu014
7 months ago by
anu014150
India
anu014150 wrote:

Hello Biostars!

I was trying to get Gene symbols for taxon 1239 (Firmicutes) from refseq_protein ids, but was unable to do so using Biodbnet (https://biodbnet-abcc.ncifcrf.gov/db/db2db.php). Eg. 'WP_020487904.1' (https://www.ncbi.nlm.nih.gov/protein/521976633/).

Even gene2refseq file - ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz doesn't contain tax1239 or 1000277 (1239's species).

Can anyone tell me how to get all the genes & their respective FASTAs for Firmicutes ?

sequence genome gene • 281 views
ADD COMMENTlink modified 6 months ago by tdmurphy110 • written 7 months ago by anu014150

Do you need Gene symbol or gene sequences in the fasta format? Do you need this data for txid1239 or txid1000277?

For example, gene symbol info will be included in the gene table can be downloaded using following NCBI Unix eutils command.

esearch -db gene -query "txid1239[Organism:exp] "|efetch -format tabular
ADD REPLYlink modified 7 months ago • written 7 months ago by Sej Modha3.6k

I want gene symbols n fasta sequences if input is refseq protein ids for taxon id 1239.

ADD REPLYlink written 7 months ago by anu014150

You can get the sequence by doing following:

esearch -db protein -query "txid1239[Organism:exp] "| efetch -format fasta > seq.fa
ADD REPLYlink written 7 months ago by genomax55k

I know it's primitive question but how to download esearch? It's throwing error : 'No command 'esearch' found' ...

ADD REPLYlink written 6 months ago by anu014150

Okay I got it now. One can download edirect suit from here : ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/ . It contains esearch n efetch programs.

ADD REPLYlink written 6 months ago by anu014150

Still it's not working @GenoMax. After running this it's showing me 'help' of efetch :

EFETCH - retrieve entries from sequence databases.

Synopsis: efetch -options [database:]<query>

Databases: SWissprot/SP, PIR, WOrmpep/WP, EMbl, GEnbank/GB, ProDom, ProSite

Options: -a Search with Accession number -f Fasta format output -q Sequence only output (one line) -s <#> Start at position # -e <#> Stop at position # -o More options and info...

-D <dir>      Specify database directory
-H            Display index header data
-p            Display entrynames in search path
-r            Print sequence in 'raw' format
-m            Fetch from mixed mini database
-M            Mini format output
-b            Do NOT reverse the order of bytes
                          (SunOS, IRIX do reverse, Alpha not)
-d <dbfile>   Specify database file (avoid this)
-i <idxfile>  Specify index file (avoid this)
-l <divfile>  Specify division lookup table (avoid this)
-B <database> Specify database (archaic)
-A            Only return entryname for accession number
-n <name>     Give the sequence this name
-x            Don't require query to match entry's name exactly (avoid)
-w            For Wormpep: also fetch cross-referenced SwissProt entry
-h            shows this help text

Environment: SWDIR = SwissProt directory - database and EMBL index files PIRDIR = PIR -- " -- WORMDIR = Wormpep -- " -- EMBLDIR = EMBL -- " -- GBDIR = Genbank -- " -- PRODOMDIR = ProDom -- " -- PROSITEDIR = ProSite -- " -- DBDIR = User's own -- " -- (fasta format)

SEQDB database file (default SwissProt) SEQDBIDX index file DIVTABL division lookup table

Ex. setenv DBDIR /pubseq/seqlibs/embl/

Note that Prodom family consensus seqs can be fetched by PD:_#

by Erik Sonnhammer (esr@sanger.ac.uk) Version 2.1,

ADD REPLYlink written 6 months ago by anu014150

I am not sure if you are using the correct version of edirect utils. Download the latest version of the eutils from: ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/versions/current. You can also have a look this blog for more info.

ADD REPLYlink written 6 months ago by Sej Modha3.6k
0
gravatar for tdmurphy
6 months ago by
tdmurphy110
tdmurphy110 wrote:

Many of the bacteria RefSeq genomes aren't available in NCBI's Gene database, so e-utils with the gene db won't work. If you have a specific set of assemblies in mind, try downloading the "feature_table.txt" files for that set and parsing what you need from there. e.g.: https://www.ncbi.nlm.nih.gov/assembly/?term=txid1239%5Borgn%5D+latest_refseq%5Bfilter%5D Then use the "download assemblies" button to download the "Feature table" file for the RefSeq assemblies. All Firmicutes is 35k assemblies and a 4.6GB download.

Your example protein is in this file: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/372/005/GCF_000372005.1_ASM37200v1/GCF_000372005.1_ASM37200v1_feature_table.txt.gz The genomic location is in columns 7-10, and the gene symbol (if available) is in column 15. You could then use e-utils to get the FASTA sequence for that genomic range.

If you want CDS nucleotide sequence (same as the gene sequence), with gene symbols in the FASTA headers, try the "CDS from genomic" file from that same download option (31.8 GB). Your example has a header like this:

>lcl|NZ_AQYY01000001.1_cds_WP_020487904.1_359 [gene=clpB] [locus_tag=A37G_RS0101875] [protein=ATP-dependent chaperone ClpB] [protein_id=WP_020487904.1] [location=424554..427151] [gbkey=CDS] ATGGACACCGACAAGCTGACGACCCGCAGCCGGGACGCGGTCTCGGCCGCCCTGCGCACCGCTCTGACGAAAGGCAACCC GGCGGCCGAGCCGGTGCACCTGCTGTACGCGTTGCTGCTGGTCCCCGACAACACGGTCGCGCCCCTGCTGGGCTCGATCG

To do that for individual proteins via e-utils, you could use something like:

# first use the IPG report to get the nucleotide accession and location
esearch -db protein -query WP_020487904 | esummary -format ipg | grep WP_020487904
41115784    RefSeq  NZ_AQYY01000001.1   424554  427151  +   WP_020487904.1  ATP-dependent chaperone ClpB    Dehalobacter sp. FTH1   FTH1    GCF_000372005.1

# then use that location from columns 3-6 to get the sequence:
efetch -db nuccore -id NZ_AQYY01000001.1 -seq_start 424554 -seq_stop 427151 -strand plus -format fasta_cds_na
>lcl|NZ_AQYY01000001.1_cds_WP_020487904.1_1 [gene=clpB] [locus_tag=A37G_RS0101875] [protein=ATP-dependent chaperone ClpB] [protein_id=WP_020487904.1] [location=424554..427151] [gbkey=CDS]
ATGGACACCGACAAGCTGACGACCCGCAGCCGGGACGCGGTCTCGGCCGCCCTGCGCACCGCTCTGACGA

Keep in mind a single WP may be found on multiple assemblies (or even at multiple locations of the same assembly), so the IPG report may have multiple rows for the same WP accession.

Note only about 10% of the genes for that assembly have gene symbols assigned. Protein names on WPs are better defined than gene symbols.

ADD COMMENTlink written 6 months ago by tdmurphy110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1198 users visited in the last hour