Question: Missing sequences in local NT database
2
gravatar for Prakki Rama
8 days ago by
Prakki Rama2.2k
Singapore
Prakki Rama2.2k wrote:

Hi all, I have downloaded the whole NT database locally for running BLAST. During my search, I miss some sequences in the local NT database but are found from NCBI website. These are some of the accessions which could not be found in local NT:

Error: [blastdbcmd] Entry not found: NC_019090.1
Error: [blastdbcmd] Entry not found: NC_019424.1
Error: [blastdbcmd] Entry not found: NZ_CP021711.1
Error: [blastdbcmd] Entry not found: NZ_CP021210.1
Error: [blastdbcmd] Entry not found: NC_020278.2
Error: [blastdbcmd] Entry not found: NC_019095.1
Error: [blastdbcmd] Entry not found: NZ_CP029734.1
Error: [blastdbcmd] Entry not found: NZ_CP016389.1
Error: [blastdbcmd] Entry not found: NZ_CP029974.1
Error: [blastdbcmd] Entry not found: NZ_CP015072.1
Error: [blastdbcmd] Entry not found: NZ_CP007652.1
Error: [blastdbcmd] Entry not found: NC_024954.1
Error: [blastdbcmd] Entry not found: NC_019163.1
Error: [blastdbcmd] Entry not found: NZ_CP024879.1
Error: [blastdbcmd] Entry not found: NC_015872.1
Error: [blastdbcmd] Entry not found: NZ_CP016037.1
Error: [blastdbcmd] Entry not found: NZ_CP010880.1

Doubting if I had downloaded all the files, I checked the downloaded file number (nt.1..nt.60) and confirmed with my .nal output which looks like this:

$ cat nt.nal 
#
# Alias file created 08/08/2018 12:50:38
#
TITLE Nucleotide collection (nt)
DBLIST "nt.00" "nt.01" "nt.02" "nt.03" "nt.04" "nt.05" "nt.06" "nt.07" "nt.08" "nt.09" "nt.10" "nt.11" "nt.12" "nt.13" "nt.14" "nt.15" "nt.16" "nt.17" "nt.18" "nt.19" "nt.20" "nt.21" "nt.22" "nt.23" "nt.24" "nt.25" "nt.26" "nt.27" "nt.28" "nt.29" "nt.30" "nt.31" "nt.32" "nt.33" "nt.34" "nt.35" "nt.36" "nt.37" "nt.38" "nt.39" "nt.40" "nt.41" "nt.42" "nt.43" "nt.44" "nt.45" "nt.46" "nt.47" "nt.48" "nt.49" "nt.50" "nt.51" "nt.52" "nt.53" "nt.54" "nt.55" "nt.56" "nt.57" "nt.58" "nt.59" "nt.60" 
NSEQ 49266009
LENGTH 188943333900

I randomly checked the md5sums also of NT files, and they found to be same with md5sums available in the NCBI FTP page. Am I missing something here? Many thanks for your comments in advance.

blastdbcmd blast nt ncbi • 198 views
ADD COMMENTlink modified 8 days ago by lieven.sterck2.6k • written 8 days ago by Prakki Rama2.2k

nr/nt (the one on NCBI website) is not the same database as nt which you can download from their ftp..

ADD REPLYlink written 8 days ago by 5heikki7.7k
1

I assumed they are same.

ADD REPLYlink written 8 days ago by Prakki Rama2.2k

makes two of us

ADD REPLYlink written 8 days ago by lieven.sterck2.6k

and what's the difference then?

ADD REPLYlink written 8 days ago by lieven.sterck2.6k

nr = All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects

and

nt = The nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA, patent sequences as well as phase 0, 1, and 2 HTGS sequences. The database is non-redundant. Identical sequences have been merged into one entry, while preserving the accession, GI, title and taxonomy information for each entry.

ADD REPLYlink modified 8 days ago • written 8 days ago by genomax57k
1

That nr definition is for the protein db. I don't know what exactly is different between nr nt (the one on the website) and nt (the one of the ftp), but right now nr nt has 48,336,722 seqs whereas OP's nt is slightly larger with 49,266,009 seqs. I tried a few identifiers from OP and they were all RefSeq sequences. Could it be that those seqs are in nt but not with the RefSeq identifiers but GenBank identifiers, e.g. from NZ_CP016037.1 to CP016037.1, from NC_019095.1 to JF927996.1, etc.

Edit. like the README states

  1. Non-redundant defline syntax

The non-redundant databases are nr, nt and pataa. Identical sequences are merged into one entry in these databases. To be merged two sequences must have identical lengths and every residue at every position must be the same. The FASTA deflines for the different entries that belong to one record are separated by control-A characters invisible to most programs. In the example below both entries Q57293.1 and AAB05030.1 have the same sequence, in every respect:

Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC [Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae] MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDIC IVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQQQRVALARALVLKPKVLILD EPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGEST ICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPEAIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLIN ANPDQFDPDATKAFIHFTEQGIFLLNKE

ADD REPLYlink modified 6 days ago • written 6 days ago by 5heikki7.7k

OK, right, never noticed that before but indeed it says nr/nt for the non-redundant DB in blastn

ADD REPLYlink written 6 days ago by lieven.sterck2.6k

from what I can see it's a different 'state' of non-redundancy :

from the ftp README:

nt.*tar.gz | Partially non-redundant nucleotide sequences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS.

from the NCBI blastn page:

Title:Nucleotide collection (nt) Description:The nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA, patent sequences as well as phase 0, 1, and 2 HTGS sequences. The database is non-redundant. Identical sequences have been merged into one entry, while preserving the accession, GI, title and taxonomy information for each entry.

Though I totally agree this even adds the confusion

ADD REPLYlink written 6 days ago by lieven.sterck2.6k

ok, yes, that I know.

I thought the statement was that the nr (or nt) available from the ftp is different then the one from the ncbi blast page itself

ADD REPLYlink written 8 days ago by lieven.sterck2.6k
2
gravatar for lieven.sterck
8 days ago by
lieven.sterck2.6k
Belgium, Ghent, VIB
lieven.sterck2.6k wrote:

I contacted NCBI about this issue lately and it seem they're having problems formatting the nr/nt databases that they offer via ftp. The ones on their own website are indeed OK. They are working on this issue they mentioned.

The large number of volumes of these public databases makes their maintenace a big challenge. The FASTA requires extra steps and spaces to archive, which adds significant strains to the limited resources we have on hand.

I see that the FASTA has a newer date than the preformatted, which indicate some issue in their update. Given that the FASTA has newer time stamp, it is not a surprise for that to have newer data that are not present in the preformatted versions.

I will check with our developers and ask them to look into the issue. Your patience and understanding will be greatly appreciated.

Regards,

NCBI User Services

The fasta files they provide on their ftp are OK as well, so you can download the fasta file and format the DB locally

ADD COMMENTlink modified 8 days ago • written 8 days ago by lieven.sterck2.6k
1

Thanks for posting NCBI's reply @lieven.sterck. Let me check the accession in the fasta file then.

ADD REPLYlink written 8 days ago by Prakki Rama2.2k

I also noticed it but I think that the first time I noticed it was already 6 months ago

ADD REPLYlink written 8 days ago by gb430

I am trying to download fasta file from FTP. But, connection is timed out. I also tried downloading using axel from terminal. Still had the same problem. Any other ideas to access the fasta sequence data?

ADD REPLYlink written 7 days ago by Prakki Rama2.2k

you might give one of the mirrors a try? we often use the following one: http://mirrors.bi.vt.edu/mirrors/ftp.ncbi.nih.gov/

ADD REPLYlink written 7 days ago by lieven.sterck2.6k

Sure Thanks much @lieven !! I will try it and post again!

ADD REPLYlink written 6 days ago by Prakki Rama2.2k

OK, from what I can see, it seems they updated the (preformatted) nr DBs yesterday ( / this morning) (Oct 10th)

File:nr.00.tar.gz 316659 KB 10/9/2018 9:46:00 PM

The nt is still lagging behind :/ (though they updated the fasta file of nt yesterday as well)

File:nt.00.tar.gz 828406 KB 8/12/2018 2:40:00 PM

ADD REPLYlink modified 6 days ago • written 6 days ago by lieven.sterck2.6k

But I still can't understand why it's that difficult to update the preformatted DBs along with the fasta files ? if they don't have 1 cpu and a few Gb of RAM to spare at NCBI to accomplish this, I start to get worried.

ADD REPLYlink written 6 days ago by lieven.sterck2.6k

Making a huge blast db takes a lot of time. Sure, they could hold the release of the fasta files until they have formatted and compressed the db files, but what would be gained? I don't really understand why they even offer the fasta files..

ADD REPLYlink written 6 days ago by 5heikki7.7k

Someone wrote that pipeline many moons ago. Since it used to work no one wants to mess with it. Now the data sizes have become so large that even NCBI's massive compute infrastructure may not be keeping up with refreshing these myriad files each night.

ADD REPLYlink written 6 days ago by genomax57k

yeah, OK , overnight might indeed be pushing it, but something like once a week should be possible, no?

Moreover, it would also be OK if they told that from now on they will not provide this anymore (and for instance only the fasta files), at least then we know what to expect.

Any thing is better than offering different versioned fasta and DB files!

ADD REPLYlink modified 5 days ago • written 6 days ago by lieven.sterck2.6k

For the same reason, we generally download the pre-made DB every week. Unless you need bleeding edge data there is not need to do this every night. Users can always use web blast if they are looking from the most current data.

ADD REPLYlink modified 6 days ago • written 6 days ago by genomax57k

Same approach here. And indeed once a week is more than frequent enough.

But an annoying issue we have in our lab is that for some resources we need to start from fasta files and for others we download the pre-made DBs.

Perhaps we should consider switching to just downloading the fasta files and build all the DBs from that.

ADD REPLYlink written 5 days ago by lieven.sterck2.6k
1

Do you always need the fasta files? Can those be out of sync from the pre-formatted db (as you have discovered)? So basically two separate downloads.

Wonder if recreating the fasta file from the pre-formatted DB is faster than creating the DB from the downloaded fasta.

ADD REPLYlink written 5 days ago by genomax57k

huh, ... that might be a brilliant idea (technically) to recreate from the DB rather than downloading it.

However, the pre-made DB will always be lagging behind the fasta file (as you need the fasta to create the DB), as is the case now and not with a few days but with nearly 2 months for nt . So from an 'being up-to-date' point of view we'll be better of with the fasta file

ADD REPLYlink modified 5 days ago • written 5 days ago by lieven.sterck2.6k

I actually did it this way but noticed that if I execute the following command:

blastdbcmd -db nt -entry all > nt.fa

Sometimes a sequence got two headers in nt.fa like this:

>accession|header1>accession|header2
AGTAGATAGAGAGACGACACTAGCATCA

Maybe the command is wrong... I do not remember the version I used back then

ADD REPLYlink modified 5 days ago • written 5 days ago by gb430
2

@gb These are merged data entries in NCBI databases.

ADD REPLYlink modified 5 days ago • written 5 days ago by Prakki Rama2.2k

Thanks! something learned again

ADD REPLYlink written 5 days ago by gb430
1

Here is the link for the official explanation from NCBI: A: non redundant protein sequence database

ADD REPLYlink written 5 days ago by genomax57k

Making a huge blast db takes a lot of time. Sure, they could hold the release of the fasta files until they have formatted and compressed the db files, but what would be gained? I don't really understand why they even offer the fasta files..

formatted it here myself recently, so yes indeed it takes a few hours (on a single core, no multi threading possible) but I can't see that it might be an issue for a player such as NCBI

ADD REPLYlink modified 6 days ago • written 6 days ago by lieven.sterck2.6k

Did you create it with the parse_seqids flag? I recall it taking way longer than a few hours. The compression part shouldn't take that long since it can be parallelized

ADD REPLYlink written 5 days ago by 5heikki7.7k

yes, with the parse_seqids option on (always do that) it took about 4 hours, on a single core (no idea you even could parallelize anything for this?)

ADD REPLYlink written 5 days ago by lieven.sterck2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2165 users visited in the last hour