I'm creating a blast database using:
makeblastdb -in proteins.fasta -dbtype prot -parse_seqids -out my_protein_db
I was trying to extract some sequences from this using blastdbcmd but kept getting error messages of "Entry not found".
My entries look like this: (there is 1 pipe in each entry): ABC|DEF60375.1 EHL|XP_003887.1
However if i do check the identifiers in my database using:
blastdbcmd -entry all -db my_protein_db -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"
I get lines like this:
> OID: 0 GI: N/A ACC: ABC|DEF60375.1 IDENTIFIER: gnl|ABC|DEF60375.1 > OID:0 GI: N/A ACC: EHL|XP_003887.1 IDENTIFIER: lcl|EHL|XP_003887.1
so it seems NCBI has added some text+a pipe infront of my identifiers, I can just concatenate these additional letters onto my entries when I use blastdbcmd, however I noticed that these letters are not always the same, for some cases it is "gnl|" and others it is "lcl|". Does anyone know how NCBI decides this naming convention? and whats the best way to get around this?
Thanks very much for any input