Question: what's sequences are in the precomputed human_genomic database
0
gravatar for nkinney06
12 weeks ago by
nkinney0630
nkinney0630 wrote:

I recently installed blast and downloaded the precomputed human_genomic.*tar.gz database available here:

ftp://ftp.ncbi.nlm.nih.gov/blast/db/

I tested my installation with the following fasta file:

cat test_query.fa 
>chr13:83987454-83987503
GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT

when I blast against my local database I see the primary assembly but also many additional hits:

>NC_000013.11 Homo sapiens chromosome 13, GRCh38.p7 Primary Assembly <- matches my test query
Query  1         GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT  50
Sbjct  83987454  GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT  83987503

>NT_024524.15 Homo sapiens chromosome 13 genomic scaffold, GRCh38.p7 Primary 
Query  1         GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT  50
Sbjct  65579348  GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT  65579397

>GL583019.1 Homo sapiens unplaced genomic scaffold scaffold_39, whole genome 
Query  1       GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT  50
Sbjct  731735  GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT  731686

>Lots more results...

My question is what is the source of all additional sequences that this blast database uses?

I have looked at the README (available at ftp://ftp.ncbi.nlm.nih.gov/blast/db/README) but the information there is not very thorough. Is there a complete list of what's in this database? Thanks!

blast • 181 views
ADD COMMENTlink modified 12 weeks ago by genomax55k • written 12 weeks ago by nkinney0630
1
gravatar for genomax
12 weeks ago by
genomax55k
United States
genomax55k wrote:

See if this helps:

Capture

ADD COMMENTlink written 12 weeks ago by genomax55k

this is better than the README file but when I use blast is says

Effective search space used: 1344767968614
  Database: NCBI genome chromosomes - human
    Posted date:  Jul 19, 2017  11:08 PM
  Number of letters in database: 64,036,671,579
  Number of sequences in database:  3,505

Perhaps the database also includes some older assemblies and unplaced contigs?

ADD REPLYlink written 12 weeks ago by nkinney0630
1

Take a look to see what is included using this command:

blastdbcmd -db human_genomic -entry all -outfmt %i%t

That said NCBI is offering something different on their human genome blast page where I captured the above screenshot from.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by genomax55k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 644 users visited in the last hour