I have a question regarding the Kraken 2 classifier, and maybe you would be able to let me know if I am thinking about this incorrectly. For Kraken 2, to build our own custom database, we need the following (Here is the reference):
1) Install a taxonomy (NCBI)
2) Install one or more reference libraries (we can also include our own sequences in this step using FASTA files)
3) Build the database using certain Kraken 2 command
For Kraken 2, to add other genomes for step 2, the documentation says to have sequences in FASTA or multi-FASTA files. Each sequence ID in the file(s) should also contain an NCBI accession number or an explicit assignment to a taxid. If I had my own database that has a column of strain sequences, strain names, and another column with the matching NCBI accession number, I would I be able to add these sequences to step 2 by making my own FASTA file from this information.
Would it be possible to get Kraken 2 to classify reads that match these strains from our own custom database? (Kraken 2 documentation says that it does not classify reads at the strain level)
I suppose I'm more confused about why some tools only allow for classification to the species level when we can make your own database that provides sequences at the strain level (unless the classifier tool is not able to look up the strain information from NCBI to be able to classify the reads properly)? Please let me know if there is any gap in my understanding.
UPDATE: Kraken 2 allows for strain level if you use your own custom database as long as the kmers are unique enough to classify at the strain level.