Question

Kraken2 database

0

Entering edit mode

6 months ago

Christopher ▴ 10

Hello everyone, today I found myself thinking about something. I'm new to the field of bioinformatics and I'm running some analyses in Kraken2, Bracken, and also Krona. I'm already managing to work with these tools. My question is, on BenLangmead's GitHub, there are some database addresses like EuPathDB462 and MicrobialDB. My question regarding these databases is as follows. I'm conducting a metagenomic study to identify pathogenic species of bacteria and fungi in soil. I would like to know if I can use these databases for my analyses. And why do I think this should work? My conclusion is that when collecting soil samples for analysis, it's more likely to find more pathogenic species for soil than for humans, for example. Just as it's expected that when performing a metagenome of intestinal microbiota, more pathogenic species for humans will be found than any other type.

Is my thinking correct? Can I use these databases to perform the analysis of pathogenic species for soil?

kraken microbialdb database krakendb kraken2 • 2.2k views

ADD COMMENT • link updated 5 months ago by Mathew ▴ 160 • written 6 months ago by Christopher ▴ 10

score 3 · Accepted Answer · 2024-05-11

3

Entering edit mode

6 months ago

Mathew ▴ 160

Sorry, what analysis are you trying to perform with these databases?

Based on your question, it appears you are using Kraken2 and Bracken. I will include the publication of Kraken here, which note it has been cited over 4,000 times: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9725748/.

Perhaps you have taken soil samples and used Kraken2 to identify bacterial/fungal pathogens in your soil, and Bracken to do microbiome analysis (i.e., estimate the abundance of species in microbiome samples and compute the diversity changes between them).

From this paper:

Kraken 2’s classification sensitivity and specificity highly depend on how (1) complete and (2) accurate the used reference database is.

The paper also describes several pre-built Kraken 2 databases that are available at https://benlangmead.github.io/aws-indexes/k2, which I believe maybe you are referencing with Ben's GitHub?

They state

The most commonly used database is the standard Kraken 2 database (which includes RefSeq archaea, bacteria, viruses, plasmid complete genomes, UniVec Core, and the most recent human reference genome, GRCh38). ..."In addition to the standard database, we provide expanded standard databases with RefSeq protozoa, fungi, and plant genomes".

On this, I see a collection called "PlusPF" that is the standard database including RefSeq fungi. What database are you using for your Kraken2 step? Is it the standard database? You might be missing out identifying potential fungi. Just glancing at EuPathDB46, there is no reason that screams out to me why they would have more bacterial/fungal genomes found in soil than any of the other ones. RefSeq has reference genomes collected from all sources, including soil.

To perhaps answer your question, think carefully about the database you are using to identify your pathogens, as Kraken2's classification sensitivity is dependent upon this. You could do a literature search and see how other researchers perform analysis of pathogen species for soil. Usually, you will never be the first guy to try to do something, so reading literature is a great way to see what has worked for others.

ADD COMMENT • link 6 months ago by Mathew ▴ 160

0

Entering edit mode

Hello, Mathew. Firstly, thank you for answering my question.

Sorry, what analysis are you trying to perform with these databases?

I actually forgot to mention what my analysis is about. Well, I'm conducting a 16S and ITS metagenomic analysis of soil from agricultural regions. So, I'm analyzing the species of fungi and bacteria present in this soil.

Perhaps you have taken soil samples and used Kraken2 to identify bacterial/fungal pathogens in your soil, and Bracken to do microbiome analysis (i.e., estimate the abundance of species in microbiome samples and compute the diversity changes between them).

Exactly, that's what I'm doing.

The paper also describes several pre-built Kraken 2 databases that are available at https://benlangmead.github.io/aws-indexes/k2, which I believe maybe you are referencing with Ben's GitHub?

For bacteria identification, I'm using the Standard-16 database (I believe it's the most comprehensive because it's the heaviest file). For fungi identification, I'm using the Standard_PlusPF database, and for pathogenic fungi identification, I'm using the EuPathDB462.

To perhaps answer your question, think carefully about the database you are using to identify your pathogens, as Kraken2's classification sensitivity is dependent upon this. You could do a literature search and see how other researchers perform analysis of pathogen species for soil. Usually, you will never be the first guy to try to do something, so reading literature is a great way to see what has worked for others.

I've searched extensively in the literature to try to find information regarding databases for pathogenic bacterial species, and even so, it's been very difficult to find. I read in some article (2020), I don't remember which one now, that this is indeed a very challenging issue to find (is it true?).

I'd like to ask another question here, if you allow me. Yesterday, I ran some analyses and noticed that in my fungal analyses, even though I used databases only for fungi, some species and genera of bacteria appeared. Is this correct? I don't understand why bacteria appear while the databases are only for eukaryotes.

I'd like to thank you again, Mathew, for kindly answering my question, and thank you very much for the article. I'll read it more carefully later.

ADD REPLY • link 5 months ago by Christopher ▴ 10

1

Entering edit mode

I don't see any databases with just pathogenic bacteria genomes from just a quick search, I would imagine that using the Standard-16 database would give you all the pathogenic and nonpathogenic strains. I don't know how many it actually gave you, i.e., how feasible it would be to go through and note which ones are pathogenic versus which ones aren't.

If it gave you bacteria genera after Standard_PlusPF for your fungal analysis, this would be expected because it is just included the fungal genomes along with the standard-16 genomes, which includes bacteria. As far as EuPathDB462, it is interesting that bacteria genera appear based on what they say their description is:

Eukaryotic pathogen, vector, & host informatics resources.

I would suspect to see maybe protists/fungi/yeast. It could be that their database just has some pathogenic bacteria in it as well (on purpose or by accident).

ADD REPLY • link 5 months ago by Mathew ▴ 160