Question

Missing sequences in local NT database

3

Entering edit mode

5.5 years ago

Prakki Rama ★ 2.7k

Hi all, I have downloaded the whole NT database locally for running BLAST. During my search, I miss some sequences in the local NT database but are found from NCBI website. These are some of the accessions which could not be found in local NT:

Error: [blastdbcmd] Entry not found: NC_019090.1
Error: [blastdbcmd] Entry not found: NC_019424.1
Error: [blastdbcmd] Entry not found: NZ_CP021711.1
Error: [blastdbcmd] Entry not found: NZ_CP021210.1
Error: [blastdbcmd] Entry not found: NC_020278.2
Error: [blastdbcmd] Entry not found: NC_019095.1
Error: [blastdbcmd] Entry not found: NZ_CP029734.1
Error: [blastdbcmd] Entry not found: NZ_CP016389.1
Error: [blastdbcmd] Entry not found: NZ_CP029974.1
Error: [blastdbcmd] Entry not found: NZ_CP015072.1
Error: [blastdbcmd] Entry not found: NZ_CP007652.1
Error: [blastdbcmd] Entry not found: NC_024954.1
Error: [blastdbcmd] Entry not found: NC_019163.1
Error: [blastdbcmd] Entry not found: NZ_CP024879.1
Error: [blastdbcmd] Entry not found: NC_015872.1
Error: [blastdbcmd] Entry not found: NZ_CP016037.1
Error: [blastdbcmd] Entry not found: NZ_CP010880.1

Doubting if I had downloaded all the files, I checked the downloaded file number (nt.1..nt.60) and confirmed with my .nal output which looks like this:

$ cat nt.nal 
#
# Alias file created 08/08/2018 12:50:38
#
TITLE Nucleotide collection (nt)
DBLIST "nt.00" "nt.01" "nt.02" "nt.03" "nt.04" "nt.05" "nt.06" "nt.07" "nt.08" "nt.09" "nt.10" "nt.11" "nt.12" "nt.13" "nt.14" "nt.15" "nt.16" "nt.17" "nt.18" "nt.19" "nt.20" "nt.21" "nt.22" "nt.23" "nt.24" "nt.25" "nt.26" "nt.27" "nt.28" "nt.29" "nt.30" "nt.31" "nt.32" "nt.33" "nt.34" "nt.35" "nt.36" "nt.37" "nt.38" "nt.39" "nt.40" "nt.41" "nt.42" "nt.43" "nt.44" "nt.45" "nt.46" "nt.47" "nt.48" "nt.49" "nt.50" "nt.51" "nt.52" "nt.53" "nt.54" "nt.55" "nt.56" "nt.57" "nt.58" "nt.59" "nt.60" 
NSEQ 49266009
LENGTH 188943333900

I randomly checked the md5sums also of NT files, and they found to be same with md5sums available in the NCBI FTP page. Am I missing something here? Many thanks for your comments in advance.

nt blast ncbi blastdbcmd • 3.1k views

ADD COMMENT • link updated 5.5 years ago by lieven.sterck 15k • written 5.5 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

nr/nt (the one on NCBI website) is not the same database as nt which you can download from their ftp..

ADD REPLY • link 5.5 years ago by 5heikki 11k

1

Entering edit mode

I assumed they are same.

ADD REPLY • link 5.5 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

makes two of us

ADD REPLY • link 5.5 years ago by lieven.sterck 15k

0

Entering edit mode

and what's the difference then?

ADD REPLY • link 5.5 years ago by lieven.sterck 15k

0

Entering edit mode

nr = All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects

and

nt = The nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA, patent sequences as well as phase 0, 1, and 2 HTGS sequences. The database is non-redundant. Identical sequences have been merged into one entry, while preserving the accession, GI, title and taxonomy information for each entry.

ADD REPLY • link 5.5 years ago by GenoMax 141k

1

Entering edit mode

That nr definition is for the protein db. I don't know what exactly is different between nr nt (the one on the website) and nt (the one of the ftp), but right now nr nt has 48,336,722 seqs whereas OP's nt is slightly larger with 49,266,009 seqs. I tried a few identifiers from OP and they were all RefSeq sequences. Could it be that those seqs are in nt but not with the RefSeq identifiers but GenBank identifiers, e.g. from NZ_CP016037.1 to CP016037.1, from NC_019095.1 to JF927996.1, etc.

Edit. like the README states

Non-redundant defline syntax

The non-redundant databases are nr, nt and pataa. Identical sequences are merged into one entry in these databases. To be merged two sequences must have identical lengths and every residue at every position must be the same. The FASTA deflines for the different entries that belong to one record are separated by control-A characters invisible to most programs. In the example below both entries Q57293.1 and AAB05030.1 have the same sequence, in every respect:

Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC [Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae] MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDIC IVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQQQRVALARALVLKPKVLILD EPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGEST ICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPEAIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLIN ANPDQFDPDATKAFIHFTEQGIFLLNKE

ADD REPLY • link 5.5 years ago by 5heikki 11k

0

Entering edit mode

OK, right, never noticed that before but indeed it says nr/nt for the non-redundant DB in blastn

ADD REPLY • link 5.5 years ago by lieven.sterck 15k

0

Entering edit mode

from what I can see it's a different 'state' of non-redundancy :

from the ftp README:

nt.*tar.gz | Partially non-redundant nucleotide sequences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS.

from the NCBI blastn page:

Title:Nucleotide collection (nt) Description:The nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA, patent sequences as well as phase 0, 1, and 2 HTGS sequences. The database is non-redundant. Identical sequences have been merged into one entry, while preserving the accession, GI, title and taxonomy information for each entry.

Though I totally agree this even adds the confusion

ADD REPLY • link 5.5 years ago by lieven.sterck 15k

0

Entering edit mode

ok, yes, that I know.

I thought the statement was that the nr (or nt) available from the ftp is different then the one from the ncbi blast page itself

ADD REPLY • link 5.5 years ago by lieven.sterck 15k

0

Entering edit mode

Actually, this is not straight forward. Downloading large files is timing out. So, for now, I just used NCBI eutils to download my sequence from NCBI online:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=sequences&id=NC_019090.1&rettype=fasta&retmode=text

This also has problem sometimes downloading the sequence if the internet connection is slightly off.

ADD REPLY • link 5.5 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

You are in Singapore so bandwidth should not be an issue unless you are behind some restrictive scanning firewall.

ADD REPLY • link 5.5 years ago by GenoMax 141k

score 2 · Answer 1 · 2018-10-08

2

Entering edit mode

5.5 years ago

lieven.sterck 15k

I contacted NCBI about this issue lately and it seem they're having problems formatting the nr/nt databases that they offer via ftp. The ones on their own website are indeed OK. They are working on this issue they mentioned.

The large number of volumes of these public databases makes their maintenace a big challenge. The FASTA requires extra steps and spaces to archive, which adds significant strains to the limited resources we have on hand.

I see that the FASTA has a newer date than the preformatted, which indicate some issue in their update. Given that the FASTA has newer time stamp, it is not a surprise for that to have newer data that are not present in the preformatted versions.

I will check with our developers and ask them to look into the issue. Your patience and understanding will be greatly appreciated.

Regards,

NCBI User Services

The fasta files they provide on their ftp are OK as well, so you can download the fasta file and format the DB locally

ADD COMMENT • link 5.5 years ago by lieven.sterck 15k

1

Entering edit mode

Thanks for posting NCBI's reply @lieven.sterck. Let me check the accession in the fasta file then.

ADD REPLY • link 5.5 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

I also noticed it but I think that the first time I noticed it was already 6 months ago

ADD REPLY • link 5.5 years ago by gb ★ 2.2k

0

Entering edit mode

I am trying to download fasta file from FTP. But, connection is timed out. I also tried downloading using axel from terminal. Still had the same problem. Any other ideas to access the fasta sequence data?

ADD REPLY • link 5.5 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

you might give one of the mirrors a try? we often use the following one: http://mirrors.bi.vt.edu/mirrors/ftp.ncbi.nih.gov/

ADD REPLY • link 5.5 years ago by lieven.sterck 15k

0

Entering edit mode

Sure Thanks much @lieven !! I will try it and post again!

ADD REPLY • link 5.5 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

OK, from what I can see, it seems they updated the (preformatted) nr DBs yesterday ( / this morning) (Oct 10th)

File:nr.00.tar.gz 316659 KB 10/9/2018 9:46:00 PM

The nt is still lagging behind :/ (though they updated the fasta file of nt yesterday as well)

File:nt.00.tar.gz 828406 KB 8/12/2018 2:40:00 PM

ADD REPLY • link 5.5 years ago by lieven.sterck 15k

0

Entering edit mode

But I still can't understand why it's that difficult to update the preformatted DBs along with the fasta files ? if they don't have 1 cpu and a few Gb of RAM to spare at NCBI to accomplish this, I start to get worried.

ADD REPLY • link 5.5 years ago by lieven.sterck 15k

0

Entering edit mode

Making a huge blast db takes a lot of time. Sure, they could hold the release of the fasta files until they have formatted and compressed the db files, but what would be gained? I don't really understand why they even offer the fasta files..

ADD REPLY • link 5.5 years ago by 5heikki 11k

0

Entering edit mode

Someone wrote that pipeline many moons ago. Since it used to work no one wants to mess with it. Now the data sizes have become so large that even NCBI's massive compute infrastructure may not be keeping up with refreshing these myriad files each night.

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

yeah, OK , overnight might indeed be pushing it, but something like once a week should be possible, no?

Moreover, it would also be OK if they told that from now on they will not provide this anymore (and for instance only the fasta files), at least then we know what to expect.

Any thing is better than offering different versioned fasta and DB files!

ADD REPLY • link 5.5 years ago by lieven.sterck 15k

0

Entering edit mode

For the same reason, we generally download the pre-made DB every week. Unless you need bleeding edge data there is not need to do this every night. Users can always use web blast if they are looking from the most current data.

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

Same approach here. And indeed once a week is more than frequent enough.

But an annoying issue we have in our lab is that for some resources we need to start from fasta files and for others we download the pre-made DBs.

Perhaps we should consider switching to just downloading the fasta files and build all the DBs from that.

ADD REPLY • link 5.5 years ago by lieven.sterck 15k

1

Entering edit mode

Do you always need the fasta files? Can those be out of sync from the pre-formatted db (as you have discovered)? So basically two separate downloads.

Wonder if recreating the fasta file from the pre-formatted DB is faster than creating the DB from the downloaded fasta.

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

huh, ... that might be a brilliant idea (technically) to recreate from the DB rather than downloading it.

However, the pre-made DB will always be lagging behind the fasta file (as you need the fasta to create the DB), as is the case now and not with a few days but with nearly 2 months for nt . So from an 'being up-to-date' point of view we'll be better of with the fasta file

ADD REPLY • link 5.5 years ago by lieven.sterck 15k

0

Entering edit mode

I actually did it this way but noticed that if I execute the following command:

blastdbcmd -db nt -entry all > nt.fa

Sometimes a sequence got two headers in nt.fa like this:

>accession|header1>accession|header2
AGTAGATAGAGAGACGACACTAGCATCA

Maybe the command is wrong... I do not remember the version I used back then

ADD REPLY • link 5.5 years ago by gb ★ 2.2k

2

Entering edit mode

@gb These are merged data entries in NCBI databases.

ADD REPLY • link 5.5 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Thanks! something learned again

ADD REPLY • link 5.5 years ago by gb ★ 2.2k

1

Entering edit mode

Here is the link for the official explanation from NCBI: A: non redundant protein sequence database

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

Making a huge blast db takes a lot of time. Sure, they could hold the release of the fasta files until they have formatted and compressed the db files, but what would be gained? I don't really understand why they even offer the fasta files..

formatted it here myself recently, so yes indeed it takes a few hours (on a single core, no multi threading possible) but I can't see that it might be an issue for a player such as NCBI

ADD REPLY • link 5.5 years ago by lieven.sterck 15k

0

Entering edit mode

Did you create it with the parse_seqids flag? I recall it taking way longer than a few hours. The compression part shouldn't take that long since it can be parallelized

ADD REPLY • link 5.5 years ago by 5heikki 11k

0

Entering edit mode

yes, with the parse_seqids option on (always do that) it took about 4 hours, on a single core (no idea you even could parallelize anything for this?)

ADD REPLY • link 5.5 years ago by lieven.sterck 15k