Missing sequences in local NT database
1
3
Entering edit mode
2.5 years ago
Prakki Rama ★ 2.4k

Hi all, I have downloaded the whole NT database locally for running BLAST. During my search, I miss some sequences in the local NT database but are found from NCBI website. These are some of the accessions which could not be found in local NT:

Error: [blastdbcmd] Entry not found: NC_019090.1


Doubting if I had downloaded all the files, I checked the downloaded file number (nt.1..nt.60) and confirmed with my .nal output which looks like this:

\$ cat nt.nal
#
# Alias file created 08/08/2018 12:50:38
#
TITLE Nucleotide collection (nt)
DBLIST "nt.00" "nt.01" "nt.02" "nt.03" "nt.04" "nt.05" "nt.06" "nt.07" "nt.08" "nt.09" "nt.10" "nt.11" "nt.12" "nt.13" "nt.14" "nt.15" "nt.16" "nt.17" "nt.18" "nt.19" "nt.20" "nt.21" "nt.22" "nt.23" "nt.24" "nt.25" "nt.26" "nt.27" "nt.28" "nt.29" "nt.30" "nt.31" "nt.32" "nt.33" "nt.34" "nt.35" "nt.36" "nt.37" "nt.38" "nt.39" "nt.40" "nt.41" "nt.42" "nt.43" "nt.44" "nt.45" "nt.46" "nt.47" "nt.48" "nt.49" "nt.50" "nt.51" "nt.52" "nt.53" "nt.54" "nt.55" "nt.56" "nt.57" "nt.58" "nt.59" "nt.60"
NSEQ 49266009
LENGTH 188943333900


I randomly checked the md5sums also of NT files, and they found to be same with md5sums available in the NCBI FTP page. Am I missing something here? Many thanks for your comments in advance.

nt blast ncbi blastdbcmd • 1.1k views
0
Entering edit mode

nr/nt (the one on NCBI website) is not the same database as nt which you can download from their ftp..

1
Entering edit mode

I assumed they are same.

0
Entering edit mode

makes two of us

0
Entering edit mode

and what's the difference then?

0
Entering edit mode

nr = All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects

and

nt = The nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA, patent sequences as well as phase 0, 1, and 2 HTGS sequences. The database is non-redundant. Identical sequences have been merged into one entry, while preserving the accession, GI, title and taxonomy information for each entry.

1
Entering edit mode

That nr definition is for the protein db. I don't know what exactly is different between nr nt (the one on the website) and nt (the one of the ftp), but right now nr nt has 48,336,722 seqs whereas OP's nt is slightly larger with 49,266,009 seqs. I tried a few identifiers from OP and they were all RefSeq sequences. Could it be that those seqs are in nt but not with the RefSeq identifiers but GenBank identifiers, e.g. from NZ_CP016037.1 to CP016037.1, from NC_019095.1 to JF927996.1, etc.

1. Non-redundant defline syntax

The non-redundant databases are nr, nt and pataa. Identical sequences are merged into one entry in these databases. To be merged two sequences must have identical lengths and every residue at every position must be the same. The FASTA deflines for the different entries that belong to one record are separated by control-A characters invisible to most programs. In the example below both entries Q57293.1 and AAB05030.1 have the same sequence, in every respect:

Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC [Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae] MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDIC IVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQQQRVALARALVLKPKVLILD EPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGEST ICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPEAIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLIN ANPDQFDPDATKAFIHFTEQGIFLLNKE

0
Entering edit mode

OK, right, never noticed that before but indeed it says nr/nt for the non-redundant DB in blastn

0
Entering edit mode

from what I can see it's a different 'state' of non-redundancy :

nt.*tar.gz | Partially non-redundant nucleotide sequences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS.

from the NCBI blastn page:

Title:Nucleotide collection (nt) Description:The nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA, patent sequences as well as phase 0, 1, and 2 HTGS sequences. The database is non-redundant. Identical sequences have been merged into one entry, while preserving the accession, GI, title and taxonomy information for each entry.

Though I totally agree this even adds the confusion

0
Entering edit mode

ok, yes, that I know.

I thought the statement was that the nr (or nt) available from the ftp is different then the one from the ncbi blast page itself

0
Entering edit mode

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=sequences&id=NC_019090.1&rettype=fasta&retmode=text


0
Entering edit mode

You are in Singapore so bandwidth should not be an issue unless you are behind some restrictive scanning firewall.

2
Entering edit mode
2.5 years ago

I contacted NCBI about this issue lately and it seem they're having problems formatting the nr/nt databases that they offer via ftp. The ones on their own website are indeed OK. They are working on this issue they mentioned.

The large number of volumes of these public databases makes their maintenace a big challenge. The FASTA requires extra steps and spaces to archive, which adds significant strains to the limited resources we have on hand.

I see that the FASTA has a newer date than the preformatted, which indicate some issue in their update. Given that the FASTA has newer time stamp, it is not a surprise for that to have newer data that are not present in the preformatted versions.

I will check with our developers and ask them to look into the issue. Your patience and understanding will be greatly appreciated.

Regards,

NCBI User Services

The fasta files they provide on their ftp are OK as well, so you can download the fasta file and format the DB locally

1
Entering edit mode

Thanks for posting NCBI's reply @lieven.sterck. Let me check the accession in the fasta file then.

0
Entering edit mode

I also noticed it but I think that the first time I noticed it was already 6 months ago

0
Entering edit mode

I am trying to download fasta file from FTP. But, connection is timed out. I also tried downloading using axel from terminal. Still had the same problem. Any other ideas to access the fasta sequence data?

0
Entering edit mode

you might give one of the mirrors a try? we often use the following one: http://mirrors.bi.vt.edu/mirrors/ftp.ncbi.nih.gov/

0
Entering edit mode

Sure Thanks much @lieven !! I will try it and post again!

0
Entering edit mode

OK, from what I can see, it seems they updated the (preformatted) nr DBs yesterday ( / this morning) (Oct 10th)

File:nr.00.tar.gz 316659 KB 10/9/2018 9:46:00 PM

The nt is still lagging behind :/ (though they updated the fasta file of nt yesterday as well)

File:nt.00.tar.gz 828406 KB 8/12/2018 2:40:00 PM

0
Entering edit mode

But I still can't understand why it's that difficult to update the preformatted DBs along with the fasta files ? if they don't have 1 cpu and a few Gb of RAM to spare at NCBI to accomplish this, I start to get worried.

0
Entering edit mode

Making a huge blast db takes a lot of time. Sure, they could hold the release of the fasta files until they have formatted and compressed the db files, but what would be gained? I don't really understand why they even offer the fasta files..

0
Entering edit mode

Someone wrote that pipeline many moons ago. Since it used to work no one wants to mess with it. Now the data sizes have become so large that even NCBI's massive compute infrastructure may not be keeping up with refreshing these myriad files each night.

0
Entering edit mode

yeah, OK , overnight might indeed be pushing it, but something like once a week should be possible, no?

Moreover, it would also be OK if they told that from now on they will not provide this anymore (and for instance only the fasta files), at least then we know what to expect.

Any thing is better than offering different versioned fasta and DB files!

0
Entering edit mode

For the same reason, we generally download the pre-made DB every week. Unless you need bleeding edge data there is not need to do this every night. Users can always use web blast if they are looking from the most current data.

0
Entering edit mode

Same approach here. And indeed once a week is more than frequent enough.

But an annoying issue we have in our lab is that for some resources we need to start from fasta files and for others we download the pre-made DBs.

Perhaps we should consider switching to just downloading the fasta files and build all the DBs from that.

1
Entering edit mode

Do you always need the fasta files? Can those be out of sync from the pre-formatted db (as you have discovered)? So basically two separate downloads.

Wonder if recreating the fasta file from the pre-formatted DB is faster than creating the DB from the downloaded fasta.

0
Entering edit mode

huh, ... that might be a brilliant idea (technically) to recreate from the DB rather than downloading it.

However, the pre-made DB will always be lagging behind the fasta file (as you need the fasta to create the DB), as is the case now and not with a few days but with nearly 2 months for nt . So from an 'being up-to-date' point of view we'll be better of with the fasta file

0
Entering edit mode

I actually did it this way but noticed that if I execute the following command:

blastdbcmd -db nt -entry all > nt.fa


Sometimes a sequence got two headers in nt.fa like this:

>accession|header1>accession|header2
AGTAGATAGAGAGACGACACTAGCATCA


Maybe the command is wrong... I do not remember the version I used back then

2
Entering edit mode

@gb These are merged data entries in NCBI databases.

0
Entering edit mode

Thanks! something learned again

1
Entering edit mode

Here is the link for the official explanation from NCBI: A: non redundant protein sequence database

0
Entering edit mode

Making a huge blast db takes a lot of time. Sure, they could hold the release of the fasta files until they have formatted and compressed the db files, but what would be gained? I don't really understand why they even offer the fasta files..

formatted it here myself recently, so yes indeed it takes a few hours (on a single core, no multi threading possible) but I can't see that it might be an issue for a player such as NCBI

0
Entering edit mode

Did you create it with the parse_seqids flag? I recall it taking way longer than a few hours. The compression part shouldn't take that long since it can be parallelized

0
Entering edit mode

yes, with the parse_seqids option on (always do that) it took about 4 hours, on a single core (no idea you even could parallelize anything for this?)