Question: Blastdbcmd command is failing to give sequences.
0
gravatar for Prasad
3.9 years ago by
Prasad 10
United States
Prasad 10 wrote:

Hello all,

I am trying to retrieve sequences from NCBI nr database. Using following command.

blastdbcmd -entry 'all' -db Path/to/db -outfmt '%f' -out output.fasta 

It starts to give out fasta file but command fails after sometime. And the error message I get is 

Error: CSeqDBAtlas::MapMmap: While mapping file [/mnt/LV1/blast_db/nr_nt/nr.07.psq] with 12898483378 bytes allocated, caught exception:
NCBI C++ Exception:
    "/build/buildd/ncbi-blast+-2.2.28/c++/src/objtools/blast/seqdb_reader/seqdbatlas.cpp", line 152: Error: ncbi::SeqDB_ThrowException() - Validation failed: [end <= file_size] at /build/buildd/ncbi-blast+-2.2.28/c++/src/objtools/blast/seqdb_reader/seqdbatlas.cpp:506

Has anyone faced this problem? If so, then how to fix this error.

Any help is appreciated.

Thanks

 

blast • 1.9k views
ADD COMMENTlink modified 3.9 years ago by pld4.8k • written 3.9 years ago by Prasad 10
0
gravatar for pld
3.9 years ago by
pld4.8k
United States
pld4.8k wrote:

Seems like the database you have might be corrupted, maybe one of the files is incomplete (didn't download completely).

Any reason why you need every single sequence from NR in fasta? That is a massive file.You can just download it manually from FTP:

ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/

ADD COMMENTlink written 3.9 years ago by pld4.8k

I am trying to compare 2 NR database versions. I want to get gene identifiers ids from both the database so that I can blast sequence only to new gene identifier(GI) ids. For latest nr database I could download the fasta file and grep all GI ids from it. But for older nr database we don't have fasta file so first I was trying to get just GI from the nr database using  

 blastdbcmd -entry 'all' -db Path/to/db -outfmt '%g' -out gi_id_list.txt

It is taking forever to complete that process as far as my calculations it will take around 27 days to get the gi_id_list.txt.

That's the reason why I was trying to get every single sequence from NR in fasta so that I can grep out the GI id from it quickly. 

Is there anyway quicker way to get all the GI's from older NR database.

ADD REPLYlink written 3.9 years ago by Prasad 10

Look Is there any BLAST database archive? for an alternative approach.

ADD REPLYlink written 3.9 years ago by h.mon24k

Sounds good to me. Will give a try to this approach.

 

ADD REPLYlink written 3.9 years ago by Prasad 10

If you read the documentation for blastdbcmd, it will lay out some of the available output options. I know it is possible to only collect ids/accessions/etc.

ADD REPLYlink written 3.9 years ago by pld4.8k

Yeah I blastdbcmd does give just the ids. I did try  

blastdbcmd -entry 'all' -db Path/to/db -outfmt '%g' -out gi_id_list.txt  which just gives out the ids but as I said in my earlier comment it will take forever to complete it. 

I am looking for faster way to get GI id list from NR database. As of now I only see the quickest to get GI is from Fasta file.

 

ADD REPLYlink written 3.9 years ago by Prasad 10

I think either way will take a really long time unless you can fit the whole file into memory. Getting the GI from the fasta file would require you parsing each fasta definition line, which might slow you down.

ADD REPLYlink written 3.9 years ago by pld4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 775 users visited in the last hour