Question: Why is pre formatted refseq database larger than nt database in blastdb?
2
gravatar for shl198
5.2 years ago by
shl198350
United States
shl198350 wrote:

I went to the blast ftp database ftp://ftp.ncbi.nlm.nih.gov/blast/db/, there are 18 nt files, each is less than 800 MB, and for refseq_genome it has 83 files, most of which are larger than 800 MB, which means the refseq_genome is much larger than nt database. However, when I search the definition of nt on http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml, it says nt database include All GenBank + RefSeq Nucleotides + EMBL + DDBJ + PDB sequences (excluding HTGS0,1,2, EST, GSS, STS, PAT, WGS). No longer "non-redundant".

My question is:

1. In my understanding RefSeq Nucleotides should include refseq_genome and refseq_rna, so refseq_genome should be much smaller than nt database. why is refseq_genome alone is much larger than the whole nt database?

2. I tried one accession number NZ_AARG01000001.1 from refseq bacteria genome, and blastn against nt and refseq_genome database. For nt case, it took a few seconds and got less than 10 hits. For refseq_genome database, it took more than 10 minutes and got more than 100 results (all the accession number began with NZ). Then I searched NZ and found NZ represent not completed project. So the difference between nt and refseq_genome is that nt doesn't include NZ records?

blast nt refseq • 4.6k views
ADD COMMENTlink modified 5.2 years ago by hpmcwill1.1k • written 5.2 years ago by shl198350
7
gravatar for hpmcwill
5.2 years ago by
hpmcwill1.1k
United Kingdom
hpmcwill1.1k wrote:

On "RefSeq accession numbers and molecule types" you will see that the RefSeq accessions with the prefix 'NZ_' are from whole genome shotgun (WGS) projects. As such these are excluded from 'nt'. Looking through the other 'genomic' sections of RefSeq, many of these are from WGS projects and are thus also excluded.

From the NCBI BLAST pages 'nt' is currently defined as:

Title:Nucleotide collection (nt)
Description:The nucleotide collection consists of GenBank+EMBL+DDBJ+PDB+RefSeq sequences, but excludes EST, STS, GSS, WGS, TSA, patent sequences as well as phase 0, 1, and 2 HTGS sequences. The database is partially non-redundant. In some cases identical sequences have been merged into one entry, while preserving the accession, GI, title and taxonomy information for each entry. Merged sequences include GenBank and RefSeq entries with identical sequences. Sequences added to the database since April, 2011 have also been merged with identical existing entries.
Molecule Type:mixed DNA
Update date:2014/07/24
Number of sequences:23840180

In contrast 'refseq_genomic' is defined as:

Title:NCBI Genomic Reference Sequences
Molecule Type:mixed DNA
Update date:2014/07/23
Number of sequences:6733817

Note the difference in the number of sequences. However the 'refseq_genomic' is much larger when you look at the number of bases: 435,293,002,525 vs. 62,649,172,490. This is due to 'refseq_genomic' including assembled contigs, and whole chromosome assemblies, which are excluded from 'nt'.

 

ADD COMMENTlink written 5.2 years ago by hpmcwill1.1k

Hi, I just wonder how you get the information of

Molecule Type:mixed DNA
Update date:2014/07/24
Number of sequences:23840180.

And also the number of bases? Thanks.

ADD REPLYlink written 5.2 years ago by shl198350

The summary information for the databases is from the NCBI's BLAST service, the database help ('?' icon next to the database selection) shows the details of the database. The information for the number of bases in the database comes from the summary information included in BLAST search results for each database, the location of this varies depending on the output format, on the NCBI's BLAST service this is available in the "Search Summary" section of the default HTML result.

ADD REPLYlink written 5.2 years ago by hpmcwill1.1k

Thank you. This helps me a lot. I have a follow-up questions. How will it be like if we draw a venn diagram to show the relationship among nt database, refseq genome sequences and refseq representative sequences? Thanks.

ADD REPLYlink written 2.9 years ago by xieshaojun0621130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1644 users visited in the last hour