Extract all bacteria sequences from the nr database
2
0
Entering edit mode
7.2 years ago
Penny Liu ▴ 30

Is there a quick way to extract all bacteria sequences from NCBI non-redundant (NR) database using blastdbcmd?

This command applies only to extract all human sequences from the nr database. How about in microbial sequences case?

$ blastdbcmd -db nr -entry all -outfmt "%g %T" | \
   awk ' { if ($2 == 9606) { print $1 } } ' | \
   blastdbcmd -db nr -entry_batch - -out human_sequences.txt
NCBI nr database bacteria • 11k views
ADD COMMENT
3
Entering edit mode

BLAST+ 2.8.1 with blastdb v5 allow you to limit your search by taxonomy using information built into the BLAST databases. So you don't need to build blastdb for specific taxids now !

ADD REPLY
0
Entering edit mode

thank you very much for replies

ADD REPLY
1
Entering edit mode
7.2 years ago

Taxid for human species is 9606. For species of your interest you may find taxid on NCBI website https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi

ADD COMMENT
0
Entering edit mode

BLAST Database error: No alias or index file found for nucleotide database [nr] in search path

Can you explain what vital steps I missed out? Thank you again.

Getting all taxids of bacteria (taxid 2)

./blastdbcmd -db nr -entry all -outfmt "%g %T" | awk ' { if ($2 == 2) { print $1 } } ' | ./blastdbcmd -db nr -entry_batch - -out bacteria_sequences.txt
ADD REPLY
1
Entering edit mode

blastdbcmd uses GI (genbank ID) to taxid table which only has taxids in it on a species level. If you want to use blastdbcm to extract all bacterial NR sequences you first need to have a list of all bacterial species taxids. In order to make such list you can go to the NCBI taxid website that I mentioned https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi and go to the FTP download link there ftp://ftp.ncbi.nih.gov/pub/taxonomy/ There you will see the ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid.readme which describes GI to taxid tables mentioned above. There you will see also ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt that describes among other things nodes.dmp file that has a child-parent relationship for the taxid tree.

There are tools like TaxonKit to retrieve all taxids on a certain level for a given taxid from that database, but you might be able to do it yourself faster than installing and learning a new tool.

Once you have a list of species taxids species_taxids.list you can use it in your one-liner by replacing if($2==2) to taxid_list[$2] that you populate from a your species_taxids.list file.

The last question is how to read/write files in awk when it is in the middle of the pipe. You can do it in BEGIN part of the awk or anywhere else:

BEGIN{while (getline < "'"$INPUTFILE"'"){split($0,ft,",");id=ft[1];name=ft[2];key=id;data=name;nameArr[key]=data;}close("'"$INPUTFILE"'");
ADD REPLY
0
Entering edit mode

There are tools like TaxonKit to retrieve all taxids on a certain level for a given taxid from that database

Sorry I did not say it clearly, but TaxonKit retrieves all taxids BELOW a taxid, not ON a certain level.

There maybe some other better solutions. But TaxonKit so easy to install and use, the speed is also very fast. Why not give it try?

Taxonkit provides executable binary files for Linux/OS X/Windows, users just need to to download the tar.gz file, decompress and immediately run. It also provide detailed examples for all the commands, she only need read the example 1 of command taxonkit list, or tutorial.

ADD REPLY
0
Entering edit mode

I guess there's no NR entry with taxid 2, only that of a species has.

ADD REPLY
0
Entering edit mode

Hi, I got same error even use the ($2 == 9606).

./blastdbcmd -db nr -entry all -outfmt "%g %T" | awk ' { if ($2 == 9606) { print $1 } } ' | ./blastdbcmd -db nr -entry_batch - -out human_sequences.txt
ADD REPLY
0
Entering edit mode

How long should this command take to run ? In my case, it is running for a week and it is still not done ? Does that indicate something is wrong ?

ADD REPLY
0
Entering edit mode

Is the file you are capturing the results in growing in size? While it probably should not take that long to run but it may depend on what kind of hardware resources you have access to. If that file is not growing then you should stop the process and re-assess.

ADD REPLY
1
Entering edit mode
7.2 years ago

Extract all protein sequences of specific taxons from the NCBI nr database.

TaxonKit (can be installed via conda: conda install -c bioconda taxonkit) is used to get all taxids belonging to given taxid(s).

Taxid of Bacteria (superkingdom) is 2. The tutorial use virus as example, which has much few taxids. But I did not wait the extracting proccess finished, because blastdbcmd is so slow.

ADD COMMENT
0
Entering edit mode

The page displaying title is extract all protein sequences from the NCBI nr database. How to get DNA sequences rather than protein sequences?

ADD REPLY
1
Entering edit mode

Sorry, NR is the non-redundant protein database. Use NT instead.

ADD REPLY
0
Entering edit mode

TaxonKit worked out perfectly. However, there's something strange here. I already got all taxids of bacteria (taxid 2) and extacted accessions also GIs with csvtk (Please refer to this). Subsequently, I extracting nr sequences as the next step $ blastdbcmd -db nr -entry all -outfmt "%g\t%T" | csvtk -t grep -f 2 -P bacteria.taxid.gi.txt | csvtk -t cut -f 1 | blastdbcmd -db nr -entry_batch - -out nr.bacteria.fa. An error occured:

[ERRO] field (2) out of range (1) in file: -

What am I doing wrong?

By the way, which db should I download? (wget ftp://ftp.ncbi.nih.gov/blast/db/nr.**.tar.gz)

ADD REPLY
0
Entering edit mode

It seems that output of blastdbcmd -db nr -entry all -outfmt "%g\t%T" has only one column. It's strange, but I didn't try it.

You can run the commands separately for debug.

blastdbcmd -db nr -entry all -outfmt "%g\t%T" > nr.gi2tax

And check the nr.gi2tax.

Right, nr locates in ftp://ftp.ncbi.nih.gov/blast/db/nr.**.tar.gz

ADD REPLY
1
Entering edit mode

NCBI has stopped using gi numbers which is probably why you only got one column from the blastdbcmd command.

ADD REPLY
0
Entering edit mode

Right! This may be the point! Use accession instead

blastdbcmd -db nr -entry all -outfmt "%a\t%T" | \
    csvtk -t grep -f 2 -P virus.taxid.acc.txt | \
    csvtk -t cut -f 1 | \
    blastdbcmd -db nr -entry_batch - -out nr.virus.fa
ADD REPLY
0
Entering edit mode

We can also retrieve bacterial sequences directly from nt FASTA sequences and then makeblastdb.

gzip -c -d nt.gz | seqkit grep -f bacteria.taxid.acc.txt > nt.bacteria.fasta

makeblastdb -dbtype nucl -in nt.bacteria.fasta -out nt.bacteria

seqkit is here.

ADD REPLY
0
Entering edit mode

I am researching materials for the Biostar Handbook. It appears that most advice in this thread is not quite right anymore - might have become obsolete.

For example

  1. the nt.gz file does not contain the taxid (it does not have a NCBI formatted header which is super weird), hence seqkit grep wouldn't be able to match it.
  2. On my system the the -outfmt parameter for blastdbcmd does not insert tabs for the \t hence that advice won't work either.
ADD REPLY
0
Entering edit mode

nt blast database does contain taxid's. Instead of the \t a different delimiter could be used (e.g. ,) with the original awk command. I can't get the -entry_batch to work from STDIN. Were you able to?

ADD REPLY
0
Entering edit mode

yes I did get the -entry_batch to work, make sure to put the - there as well.

ADD REPLY
0
Entering edit mode

Hi, shenwei. We could get the protein accession numbers from the prot.accession2taxid.gz, but how to get the accession number of DNA sequence? There are several files under the url: ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/, which one shoud be used? Thanks!

ADD REPLY
0
Entering edit mode

nucl_gb.accession2taxid.gz

ADD REPLY

Login before adding your answer.

Traffic: 1709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6