I would like to be able to create my own custom local blast database, as this may be relevant in many different situations in bioinformatics. In this case, I hope to make a database containing all the latest versions of the bacterial genomes found in RefSeq. For starters, I have downloaded bacterial genomes (assemblies) from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria, using information in the "assembly_summary.txt" to fetch the latest genome versions only. As a result, I now have almost 104,000 files (one per bacterial genome) containing one or multiple contigs. So far, so good.
Each contig within a genome has a header containing the NCBI accession number ++, i.e.:
Genome (file) 1:
>NZ_NMDP01000102.1 Escherichia coli strain MOD1-EC6062 >NZ_NMDP01000103.1 Escherichia coli strain MOD1-EC6062
Genome (file) 2:
>NZ_NOBY01000102.1 Escherichia coli strain MOD1-EC5816 >NZ_NOBY01000115.1 Escherichia coli strain MOD1-EC5816
I now want to associate all genomes with a taxonomy (taxid?), as I understand this is important in many applications. For example, by blasting to my local database, I want to be able to quickly determine from which bacterium my blast query sequence originates.
My questions are therefore:
1. How do I find the taxon ID for all the bacterial genomes in question?
(Note: These are genomes from ../genomes/refseq/bacteria, not ..refseq/release/bacteria)?
2. How do I incorporate that information into my genome files and/or final local database?
I suspect I first have to link up the NCBI accession number in the headers to a taxon ID in some way, but I'm not sure how to do that, or in what format it should be.
All answers are highly appreciated! :)
Even Sannes Riiser,
PhD candidate, University of Oslo, Norway