Hi all!
I am downloading nt database by update_blastdb --decompress nt
Now I would like to limit database to bacteria taxid:2 .
What is the fastest way to do that?
PS. I checked similar posts, but only found solution for nr database.
Best, Agata
Hi all!
I am downloading nt database by update_blastdb --decompress nt
Now I would like to limit database to bacteria taxid:2 .
What is the fastest way to do that?
PS. I checked similar posts, but only found solution for nr database.
Best, Agata
Following should work. I tested it with a different taxID (not 2). So replace 2 in place of 9925. You will need the blast index files for nt.
blastdbcmd -db /path_to/nt -outfmt "%T %a" -entry all | awk '$1 == "9925" {print $2}' | xargs -n 1 sh -c 'blastdbcmd -db /path_to/nt -outfmt "%f" -entry "$0"' > bacteria_nt.fa
This will take a while. No way around it.
You could save the accessions numbers you need by doing this
blastdbcmd -db /path_to/nt -outfmt "%T %a" -entry all | awk '$1 == "9925" {print $2}' > acc_bact
and then extract the sequences from nt fasta files you have using faSomeRecords from Kent utilities. Don't know if that would be any faster.
EDIT 10/09/2019: This idea does not work at the top level taxID's (e.g. 2 bacteria or 2759 for Eukaryota) since the nt sequences are not annotated at that level.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I used this tutorial but instead nr I used nt database: https://bioinf.shenwei.me/taxonkit/tutorial/
Fatsa was downloaded from: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
Taxonomy: nucl_gb.accession2taxid
Taxonomy ID: 2
Best, Agata
What's the end goal? Are you planning on having a local copy of NR and a bacterial DB?
If so you might be better off just restricting your BLAST searches based on GI/Accession lists, when querying the full database.