Question

Generate local blast database with RefSeq bacteria AND taxonomy

0

Entering edit mode

6.3 years ago

even.s.riiser ▴ 10

Dear all,

I would like to be able to create my own custom local blast database, as this may be relevant in many different situations in bioinformatics. In this case, I hope to make a database containing all the latest versions of the bacterial genomes found in RefSeq. For starters, I have downloaded bacterial genomes (assemblies) from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria, using information in the "assembly_summary.txt" to fetch the latest genome versions only. As a result, I now have almost 104,000 files (one per bacterial genome) containing one or multiple contigs. So far, so good.

Each contig within a genome has a header containing the NCBI accession number ++, i.e.:

Genome (file) 1:

>NZ_NMDP01000102.1 Escherichia coli strain MOD1-EC6062

>NZ_NMDP01000103.1 Escherichia coli strain MOD1-EC6062

Genome (file) 2:

>NZ_NOBY01000102.1 Escherichia coli strain MOD1-EC5816

>NZ_NOBY01000115.1 Escherichia coli strain MOD1-EC5816

etc...

I now want to associate all genomes with a taxonomy (taxid?), as I understand this is important in many applications. For example, by blasting to my local database, I want to be able to quickly determine from which bacterium my blast query sequence originates.

My questions are therefore:

1. How do I find the taxon ID for all the bacterial genomes in question?

(Note: These are genomes from ../genomes/refseq/bacteria, not ..refseq/release/bacteria)?

2. How do I incorporate that information into my genome files and/or final local database?

I suspect I first have to link up the NCBI accession number in the headers to a taxon ID in some way, but I'm not sure how to do that, or in what format it should be.

All answers are highly appreciated! :)

Kind regards,

Even Sannes Riiser,

PhD candidate, University of Oslo, Norway

blast refseq taxid taxonomy • 2.4k views

ADD COMMENT • link 6.3 years ago by even.s.riiser ▴ 10