Question: extracting sequences from a blast database
gravatar for max_19
15 months ago by
max_19150 wrote:

Hi all!

I'm creating a blast database using:

makeblastdb -in proteins.fasta -dbtype prot -parse_seqids -out my_protein_db

I was trying to extract some sequences from this using blastdbcmd but kept getting error messages of "Entry not found".

My entries look like this: (there is 1 pipe in each entry): ABC|DEF60375.1 EHL|XP_003887.1

However if i do check the identifiers in my database using:

blastdbcmd -entry all -db my_protein_db -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"

I get lines like this:

> OID: 0 GI: N/A ACC: ABC|DEF60375.1 IDENTIFIER: gnl|ABC|DEF60375.1 

> OID:0 GI: N/A ACC: EHL|XP_003887.1 IDENTIFIER: lcl|EHL|XP_003887.1

so it seems NCBI has added some text+a pipe infront of my identifiers, I can just concatenate these additional letters onto my entries when I use blastdbcmd, however I noticed that these letters are not always the same, for some cases it is "gnl|" and others it is "lcl|". Does anyone know how NCBI decides this naming convention? and whats the best way to get around this?

Thanks very much for any input

sequencing blast protein genome • 636 views
ADD COMMENTlink modified 15 months ago by genomax87k • written 15 months ago by max_19150

What do fasta headers in your proteins.fasta look like? grep "^>" | head -3?

ADD REPLYlink written 15 months ago by genomax87k

like this:

ADD REPLYlink modified 14 months ago • written 15 months ago by max_19150

Which version of blast are you using?

See this page for additional detail.

Those are NCBI standard fasta identifiers.

ADD REPLYlink written 15 months ago by genomax87k


I will check them out, thanks!

ADD REPLYlink written 15 months ago by max_19150
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 723 users visited in the last hour