Question: Extract all bacteria sequences from the nr database
0
gravatar for Yi-Ting Liu
11 months ago by
Yi-Ting Liu30
Taiwan
Yi-Ting Liu30 wrote:

Is there a quick way to extract all bacteria sequences from NCBI non-redundant (NR) database using blastdbcmd?

This command applies only to extract all human sequences from the nr database. How about in microbial sequences case?

$ blastdbcmd -db nr -entry all -outfmt "%g %T" | \
   awk ' { if ($2 == 9606) { print $1 } } ' | \
   blastdbcmd -db nr -entry_batch - -out human_sequences.txt
bacteria nr database ncbi • 1.7k views
ADD COMMENTlink modified 5 months ago by jgl40 • written 11 months ago by Yi-Ting Liu30

thank you very much for replies

ADD REPLYlink written 5 months ago by jgl40
1
gravatar for shenwei356
11 months ago by
shenwei3563.4k
China
shenwei3563.4k wrote:

Extract all protein sequences of specific taxons from the NCBI nr database.

TaxonKit (can be installed via conda: conda install -c bioconda taxonkit) is used to get all taxids belonging to given taxid(s).

Taxid of Bacteria (superkingdom) is 2. The tutorial use virus as example, which has much few taxids. But I did not wait the extracting proccess finished, because blastdbcmd is so slow.

ADD COMMENTlink modified 11 months ago • written 11 months ago by shenwei3563.4k

The page displaying title is extract all protein sequences from the NCBI nr database. How to get DNA sequences rather than protein sequences?

ADD REPLYlink written 11 months ago by Yi-Ting Liu30
1

Sorry, NR is the non-redundant protein database. Use NT instead.

ADD REPLYlink modified 11 months ago • written 11 months ago by shenwei3563.4k

TaxonKit worked out perfectly. However, there's something strange here. I already got all taxids of bacteria (taxid 2) and extacted accessions also GIs with csvtk (Please refer to this). Subsequently, I extracting nr sequences as the next step $ blastdbcmd -db nr -entry all -outfmt "%g\t%T" | csvtk -t grep -f 2 -P bacteria.taxid.gi.txt | csvtk -t cut -f 1 | blastdbcmd -db nr -entry_batch - -out nr.bacteria.fa. An error occured:

[ERRO] field (2) out of range (1) in file: -

What am I doing wrong?

By the way, which db should I download? (wget ftp://ftp.ncbi.nih.gov/blast/db/nr.**.tar.gz)

ADD REPLYlink modified 10 months ago • written 10 months ago by Yi-Ting Liu30

It seems that output of blastdbcmd -db nr -entry all -outfmt "%g\t%T" has only one column. It's strange, but I didn't try it.

You can run the commands separately for debug.

blastdbcmd -db nr -entry all -outfmt "%g\t%T" > nr.gi2tax

And check the nr.gi2tax.

Right, nr locates in ftp://ftp.ncbi.nih.gov/blast/db/nr.**.tar.gz

ADD REPLYlink written 10 months ago by shenwei3563.4k
1

NCBI has stopped using gi numbers which is probably why you only got one column from the blastdbcmd command.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax40k

Right! This may be the point! Use accession instead

blastdbcmd -db nr -entry all -outfmt "%a\t%T" | \
    csvtk -t grep -f 2 -P virus.taxid.acc.txt | \
    csvtk -t cut -f 1 | \
    blastdbcmd -db nr -entry_batch - -out nr.virus.fa
ADD REPLYlink written 10 months ago by shenwei3563.4k

We can also retrieve bacterial sequences directly from nt FASTA sequences and then makeblastdb.

gzip -c -d nt.gz | seqkit grep -f bacteria.taxid.acc.txt > nt.bacteria.fasta

makeblastdb -dbtype nucl -in nt.bacteria.fasta -out nt.bacteria

seqkit is here.

ADD REPLYlink modified 10 months ago • written 10 months ago by shenwei3563.4k

I am researching materials for the Biostar Handbook. It appears that most advice in this thread is not quite right anymore - might have become obsolete.

For example

  1. the nt.gz file does not contain the taxid (it does not have a NCBI formatted header which is super weird), hence seqkit grep wouldn't be able to match it.
  2. On my system the the -outfmt parameter for blastdbcmd does not insert tabs for the \t hence that advice won't work either.
ADD REPLYlink written 6 months ago by Istvan Albert ♦♦ 75k

nt blast database does contain taxid's. Instead of the \t a different delimiter could be used (e.g. ,) with the original awk command. I can't get the -entry_batch to work from STDIN. Were you able to?

ADD REPLYlink modified 6 months ago • written 6 months ago by genomax40k

yes I did get the -entry_batch to work, make sure to put the - there as well.

ADD REPLYlink written 6 months ago by Istvan Albert ♦♦ 75k

Hi, shenwei. We could get the protein accession numbers from the prot.accession2taxid.gz, but how to get the accession number of DNA sequence? There are several files under the url: ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/, which one shoud be used? Thanks!

ADD REPLYlink written 7 weeks ago by niuyw0

nucl_gb.accession2taxid.gz

ADD REPLYlink written 7 weeks ago by shenwei3563.4k
0
gravatar for Petr Ponomarenko
11 months ago by
United States / Los Angeles / ALAPY.com
Petr Ponomarenko2.4k wrote:

Taxid for human species is 9606. For species of your interest you may find taxid on NCBI website https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi

ADD COMMENTlink written 11 months ago by Petr Ponomarenko2.4k

BLAST Database error: No alias or index file found for nucleotide database [nr] in search path

Can you explain what vital steps I missed out? Thank you again.

Getting all taxids of bacteria (taxid 2)

./blastdbcmd -db nr -entry all -outfmt "%g %T" | awk ' { if ($2 == 2) { print $1 } } ' | ./blastdbcmd -db nr -entry_batch - -out bacteria_sequences.txt
ADD REPLYlink modified 11 months ago • written 11 months ago by Yi-Ting Liu30
1

blastdbcmd uses GI (genbank ID) to taxid table which only has taxids in it on a species level. If you want to use blastdbcm to extract all bacterial NR sequences you first need to have a list of all bacterial species taxids. In order to make such list you can go to the NCBI taxid website that I mentioned https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi and go to the FTP download link there ftp://ftp.ncbi.nih.gov/pub/taxonomy/ There you will see the ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid.readme which describes GI to taxid tables mentioned above. There you will see also ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt that describes among other things nodes.dmp file that has a child-parent relationship for the taxid tree.

There are tools like TaxonKit to retrieve all taxids on a certain level for a given taxid from that database, but you might be able to do it yourself faster than installing and learning a new tool.

Once you have a list of species taxids species_taxids.list you can use it in your one-liner by replacing if($2==2) to taxid_list[$2] that you populate from a your species_taxids.list file.

The last question is how to read/write files in awk when it is in the middle of the pipe. You can do it in BEGIN part of the awk or anywhere else:

BEGIN{while (getline < "'"$INPUTFILE"'"){split($0,ft,",");id=ft[1];name=ft[2];key=id;data=name;nameArr[key]=data;}close("'"$INPUTFILE"'");
ADD REPLYlink modified 11 months ago • written 11 months ago by Petr Ponomarenko2.4k

There are tools like TaxonKit to retrieve all taxids on a certain level for a given taxid from that database

Sorry I did not say it clearly, but TaxonKit retrieves all taxids BELOW a taxid, not ON a certain level.

There maybe some other better solutions. But TaxonKit so easy to install and use, the speed is also very fast. Why not give it try?

Taxonkit provides executable binary files for Linux/OS X/Windows, users just need to to download the tar.gz file, decompress and immediately run. It also provide detailed examples for all the commands, she only need read the example 1 of command taxonkit list, or tutorial.

ADD REPLYlink written 11 months ago by shenwei3563.4k

I guess there's no NR entry with taxid 2, only that of a species has.

ADD REPLYlink written 11 months ago by shenwei3563.4k

Hi, I got same error even use the ($2 == 9606).

./blastdbcmd -db nr -entry all -outfmt "%g %T" | awk ' { if ($2 == 9606) { print $1 } } ' | ./blastdbcmd -db nr -entry_batch - -out human_sequences.txt
ADD REPLYlink written 11 months ago by Yi-Ting Liu30

How long should this command take to run ? In my case, it is running for a week and it is still not done ? Does that indicate something is wrong ?

ADD REPLYlink written 4 months ago by ambarishnag0

Is the file you are capturing the results in growing in size? While it probably should not take that long to run but it may depend on what kind of hardware resources you have access to. If that file is not growing then you should stop the process and re-assess.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax40k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1244 users visited in the last hour