Why is pre formatted refseq database larger than nt database in blastdb?
1
5
Entering edit mode
8.0 years ago
shl198 ▴ 420

I went to the blast ftp database, there are 18 nt files, each is less than 800 MB, and for refseq_genome it has 83 files, most of which are larger than 800 MB, which means the refseq_genome is much larger than nt database. However, when I search the definition of nt on http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml, it says nt database include All GenBank + RefSeq Nucleotides + EMBL + DDBJ + PDB sequences (excluding HTGS0,1,2, EST, GSS, STS, PAT, WGS). No longer "non-redundant".

My question is:

1. In my understanding RefSeq Nucleotides should include refseq_genome and refseq_rna, so refseq_genome should be much smaller than nt database. why is refseq_genome alone is much larger than the whole nt database?
2. I tried one accession number NZ_AARG01000001.1 from refseq bacteria genome, and blastn against nt and refseq_genome database. For nt case, it took a few seconds and got less than 10 hits. For refseq_genome database, it took more than 10 minutes and got more than 100 results (all the accession number began with NZ). Then I searched NZ and found NZ represent not completed project. So the difference between nt and refseq_genome is that nt doesn't include NZ records?
blast refseq nt • 6.5k views
11
Entering edit mode
8.0 years ago
hpmcwill ★ 1.2k

On "RefSeq accession numbers and molecule types" you will see that the RefSeq accessions with the prefix NZ_ are from whole genome shotgun (WGS) projects. As such these are excluded from 'nt'. Looking through the other 'genomic' sections of RefSeq, many of these are from WGS projects and are thus also excluded.

From the NCBI BLAST pages 'nt' is currently defined as:

Title:Nucleotide collection (nt)
Description:The nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA, patent sequences as well as phase 0, 1, and 2 HTGS sequences. The database is partially non-redundant. In some cases identical sequences have been merged into one entry, while preserving the accession, GI, title and taxonomy information for each entry. Merged sequences include GenBank and RefSeq entries with identical sequences. Sequences added to the database since April, 2011 have also been merged with identical existing entries.
Molecule Type:mixed DNA
Update date:2014/07/24
Number of sequences:23840180

In contrast 'refseq_genomic' is defined as:

Title:NCBI Genomic Reference Sequences
Molecule Type:mixed DNA
Update date:2014/07/23
Number of sequences:6733817

Note the difference in the number of sequences. However the 'refseq_genomic' is much larger when you look at the number of bases: 435,293,002,525 vs. 62,649,172,490. This is due to 'refseq_genomic' including assembled contigs, and whole chromosome assemblies, which are excluded from 'nt'.

0
Entering edit mode

Hi, I just wonder how you get the information of

Molecule Type:mixed DNA
Update date:2014/07/24
Number of sequences:23840180.


And also the number of bases? Thanks.

0
Entering edit mode

The summary information for the databases is from the NCBI's BLAST service, the database help ('?' icon next to the database selection) shows the details of the database. The information for the number of bases in the database comes from the summary information included in BLAST search results for each database, the location of this varies depending on the output format, on the NCBI's BLAST service this is available in the "Search Summary" section of the default HTML result.

1
Entering edit mode

Thank you. This helps me a lot.

I have follow-up questions. How will it be like if we draw a venn diagram to show the relationship among nt database, refseq genome sequences and refseq representative sequences? Thanks.