I want to make a blast database of insect proteins to locally blast my transcriptome assembly. I dowloaded all the accession numbers associated with insects from the ncbi website. Next, I used this command to retrieve the associated fasta files from my locally installed nr ncbi database.
blastdbcmd -db /home/db/ncbi/nr -entry_batch protein_result.txt -out insects_seq.fa
This however gives me incomplete output - a lot of accession numbers were not found: e.g. Error: CAB42201.1: OID not found
Moreover, I get a lot of multi headers entries in the output file: e.g.
>gi|1080121958|gb|AOW70003.1| arginine kinase, partial [Remella rita] >gi|1080122062|gb|AOW70055.1| arginine kinase, partial [Xenophanes tryxus] EEKVSSTLSGLEGELKGTFYPLTGMSKQTQQQLIDDHFLFKEGDRFLQAANACRFWPTGRGIYHNENKTFLVWCNEEDHL RLISMQMGGDLKTVYKRLVTAVNDIEKRIPFSHNDRLGFLTFCPTNLGTTVRASVHIKLPKLAADKAKLEEVASKYHLQV RGTRGEHTEAEGGVYDISNKRRMGLTEYDAVKEMYDG
Is there a way to avoid both issues?
Thanks a lot in advance! Janne