What is the best way to get all bacteria proteins from nr?
1
0
Entering edit mode
2.7 years ago
O.rka ▴ 710

I'm trying to figure out the best way to do this. I have the newest taxdump.tar.gz and prot.accession2taxid.gz files from NCBI.

Is there a way to use TaxonKit to get all of the species-level identifiers from bacteria and then use this to pull out the proteins from nr?

protists database nr • 879 views
ADD COMMENT
0
Entering edit mode

I am reasonably certain this was asked recently. Have you searched Biostars via google?

ADD REPLY
0
Entering edit mode
2.7 years ago
Mensur Dlakic ★ 27k

The fastest way I know is not to get them from nr at all. Uniprot has files with taxonomic divisions of all sequences, and they update them regularly.

https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/

You will need both sprot and trembl files that end in .dat.gz. esl-reformat from the HMMer package can convert these files into fasta.

But if you really want proteins from nr, blastdbcmd can do it if you have a list of accession numbers (needs nr-formatted files with accession numbers just like BLAST). I don't think this will be faster than what I described above because bacterial proteins will comprise at least half the database.

blastdbcmd -db nr -dbtype prot -entry_batch protein_list -out proteins.fas -outfmt %f -logfile proteins.log

By the way, if an answer solves your problem, please consider accepting it.

ADD COMMENT
0
Entering edit mode

One problem of getting the data from UniProt may possibly be loss of species names that one gets from non-redundant WP* entries. It may or may not be critical for OP.

ADD REPLY

Login before adding your answer.

Traffic: 2480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6