10 months ago
jobie1 ▴ 30

I would like to know more about what reference database is being used in these standard operating procedures for metagenomic analysis of microbiome samples: https://github.com/merckey/microbiome_helper/wiki/Metagenomics-Standard-Operating-Procedure-v3

Here is what is listed in the procedures for using Kraken2 to obtain "raw" taxonomy profiles

*copy the Kraken2 database to the mounted ramdisk in your scratch location (on SSD drive) to put the database directly into RAM, which speeds up processing enormously as the database normally has to be read for each sample:

cp -r /home/shared/Kraken2.0.8_Bracken150mer_RefSeqCompleteV93

It says the database it about 800GB, my question is what NCBI databases does it consist of exactly?

I found this list of what it potentially may be made up of from the Kraken2 manual where some standard genomes are named that can be used for creating a custom database:

  • bacteria: RefSeq complete bacterial/archaeal genomes
  • plasmids: RefSeq plasmid sequences
  • viruses: RefSeq complete viral genomes
  • human: GRCh38 human genome

Would it be made up of any of these? Or all of them?

This is a great best practice resource. A real "SOP" won't give you much flexibility but defines a standard operating procedure for a specific use case (This would never pass QM ;-).

Besides nit picking, they also they mention

This can be substituted for one of the smaller pre-compiled Kraken2 databases

This might be more suitable. RefSeq complete contains everything in RefSeq, including all you mentioned plus all you didn't. IMO, this might be a bit much to start with. I'd start with a smaller database. In case there's much you can't assign, and while you gather experience the RefSeq complete will always be an option.

Thank you! This was very helpful. Just curious, when you say it would never pass QM what does that mean?

At companies with regulated processes SOPs and following those are overseen by a QM or Quality Management department. There an SOP is the gold standard way to do it, there mustn't be alternative routes in the protocol and deviations need to be documented. A "best practice" protocol allows showing alternatives, if needed.

PS: best practice in this forum would be to move your answer below my comment. This makes it easier for others to follow the conversation flow

I think this is the relevant sentence from the link you posted above:

NCBI RefSeq Complete database for identification.

That would be ~800 GB.


