Question

Kraken2 nucleotide custom database

0

Entering edit mode

2.2 years ago

Natalia • 0

Hi,

I am using Kraken2 for metagenomic binning of my shotgun eDNA data. I am using a custom database of only aquatic organisms - this currently contains genome assemblies, and I am hoping to add nucleotide sequences to this database too.

Is it a good idea to pull down all nucleotide sequences for each organism from NCBI to create this database? Or is this going to add a bunch of untrustworthy sequences to my database that won't significantly improve the binning?

This is the guideline I have thought to follow so far: for those organisms that have genome assemblies, only downloading mitochondrial sequences, as the genomic nucleotide results for these will be huge and will cause duplication of sequences in the database.

Does this make sense? Is it worth downloading all these nucleotide sequences to improve Kraken2 results?

ncbi kraken2 metagenomics • 1.6k views

ADD COMMENT • link 2.2 years ago by Natalia • 0

0

Entering edit mode

Genome assemblies are made up of nucleotide sequences, so I am unsure what you mean by adding nucleotide sequences in your first paragraph.

Personally, I would not download all the sequences from each organism in your list. That seems like overkill, and as you have pointed out, there are some poor quality sequences present that can harm inferences. However, you could create filtering thresholds for this data (i.e., contiguity of assembly, assembly pipeline, date, population sampled, species range, number of phylogenetically similar taxa targeted, etc...), but this will take time and can be a lot of work - albeit worthwhile.

Finally, since you say you are using shotgun eDNA data, I don't understand why you wouldn't also include genomic sequences. If anything, adding more genome assemblies would only increase the number of informative regions kraken can assign to the species level. If you were using a locus based approach, like 16 rRNA, then taking only mtDNA sequences could make sense depending on target locus.

ADD REPLY • link 2.2 years ago by dthorbur ★ 3.1k

0

Entering edit mode

Thank you for your reply! Sorry, I don't think I was clear - I am using genomic sequences already. I was mainly wondering whether adding individual nucleotide sequences for those organisms that don't have assembled genomes would be worthwhile, or if they could just clutter up the database with junk (I have many organisms in my database without assembled genomic sequences).

The filtering thresholds sound useful, although I agree that it would be quite a lot of work.

ADD REPLY • link 2.2 years ago by Natalia • 0

0

Entering edit mode

That's hard to say without seeing what kind of assembly you want to add to the database. Generally, I think adding assembled genomes, even if they are only to the contig level, is worthwhile to give a more representative spread of your target taxa.

That said, what is more important in your study: an even spread of genomes at the cost of computational overhead, or fast processing at the cost of uneven taxonomic spread due to poor quality assemblies all being ignored? I suspect the former, but don't know for sure.

ADD REPLY • link 2.2 years ago by dthorbur ★ 3.1k

0

Entering edit mode

Yes, I am adding all available assembled genomes... I am unsure as to whether adding all available nucleotide sequences (https://www.ncbi.nlm.nih.gov/nucleotide/) belonging to organisms without assembled genomes will actually improve my classification coverage in Kraken2 (I am working with a microbiome that is still poorly characterised, so my main goal is to improve this).

I realised this will be hard to determine without trying it, but I was wondering if maybe others have done something similar to improve their Kraken2 output.

ADD REPLY • link 2.2 years ago by Natalia • 0