I recently installed blast and downloaded the precomputed human_genomic.*tar.gz database available here:
I tested my installation with the following fasta file:
cat test_query.fa >chr13:83987454-83987503 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT
when I blast against my local database I see the primary assembly but also many additional hits:
>NC_000013.11 Homo sapiens chromosome 13, GRCh38.p7 Primary Assembly <- matches my test query Query 1 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT 50 Sbjct 83987454 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT 83987503 >NT_024524.15 Homo sapiens chromosome 13 genomic scaffold, GRCh38.p7 Primary Query 1 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT 50 Sbjct 65579348 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT 65579397 >GL583019.1 Homo sapiens unplaced genomic scaffold scaffold_39, whole genome Query 1 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT 50 Sbjct 731735 GCTGGGTGGTCAGCGCTGGTTCCATGGGCAGTAATGATTTCCTCTGTTTT 731686 >Lots more results...
My question is what is the source of all additional sequences that this blast database uses?
I have looked at the README (available at ftp://ftp.ncbi.nlm.nih.gov/blast/db/README) but the information there is not very thorough. Is there a complete list of what's in this database? Thanks!