Hi everyone,
If you, like me, work with metagenomic data, you probably have used kaiju2table in the past. It's a tool provided with the Kaiju source code. It produces tsv tables that can easily be handled later on, for example for plotting.
The input data is the classic output format of Kaiju and also of Kraken:
C   A00700:50:HF7LGDRXX:1:1101:1000:10144#CCGGCATCATCTACGA  1578    100 1578:66
C   A00700:50:HF7LGDRXX:1:1101:1000:19225#CCGGCATCATCTACGA  186802  100 0:11 186802:55
C   A00700:50:HF7LGDRXX:1:1101:1000:23234#CCGGCATCATCTACGA  1578    100 1578:66
And can be from as many files as you need, which will be combined in one file containing percentages, like this:
file            percent             reads    taxon_id  taxon_name
F14_A_R1.s.out  59.89815343509682   7161673  0         Unclassified
F14_A_R1.s.out  1.080231644647389   129157   1301      Streptococcus
F14_A_R1.s.out  2.129960840275143   254667   1350      Enterococcus
F14_A_R1.s.out  1.3252716093792982  158455   1485      Clostridium
F14_A_R1.s.out  6.908532882384414   826013   1578      Lactobacillus
F14_A_R1.s.out  1.880613565083921   224854   204475    Gemmiger
F14_A_R1.s.out  3.163456075511585   378236   572511    Blautia
F14_A_R1.s.out  1.4769725746433902  176593   946234    Flavonifractor
F14_A_R1.s.out  1.0168598167829042  121580   1017280   Pseudoflavonifractor
As far as I could find, there is no such tool made for Kraken2, which is perhaps more used than Kaiju as a tool. You could, of course, try to use kaiju2table with the kraken results, but you would have to install Kaiju to have it.
Hence, for my own convenience I have made a tool called kraken2table that converts the  *.out files produced by Kraken2 (mpa format) to *.tsv tables that resemble those produced by kaiju2table.
You can find it here: https://github.com/MatteoSchiavinato/Utilities/blob/master/kraken2table
It depends on:
- ete3
- dask[complete]
The options are quite simple:
usage: kraken2table [-h] -i [INPUT_FILES [INPUT_FILES ...]] -o OUTPUT_FILE
                    [-p THREADS] [-r RANK] [-m MIN_FRAC] [-c MIN_COUNT] [-u]
optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT_FILES [INPUT_FILES ...]], --input-files [INPUT_FILES [INPUT_FILES ...]]
                        Name of input files (SPACE-separated).
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Name of output file.
  -p THREADS, --threads THREADS
                        Number of parallel threads
  -r RANK, --rank RANK  Taxonomic rank to be output, all lowercase (Default:
                        species)
  -m MIN_FRAC, --min-frac MIN_FRAC
                        Number in [0, 100], denoting the minimum required
                        percentage for the taxon (except viruses) to be
                        reported (default: 0.0)
  -c MIN_COUNT, --min-count MIN_COUNT
                        Integer number > 0, denoting the minimum required
                        number of reads for the taxon (except viruses) to be
                        reported (default: 0)
  -u, --exclude-unclassified
                        Unclassified reads are not counted for the total reads
                        when calculating percentages for classified reads.
To be honest it is not clear what this tool does, and that is perhaps the most important requirement of any software.
Kraken does produce various outputs and it is not clear in what way is this tool different and what it does.
PS you also say it depends on multiprocessing, why is that? your software does not seem to use multiprocessing
I edited answering your questions. Also, for the multiprocessing, yeah it was my mistake. My first version depended on multiprocessing, but using Dask allowed me to get rid of the multiprocessing module.
Thanks for editing the question.
Now that I understand it better I will mention that Kraken2 does have an output called report format that produces output in the following form:
In addition the recommended workflow is to process the kraken2 report with bracken:
https://ccb.jhu.edu/software/bracken/index.shtml?t=manual
that will create an output that contains a column oriented output like so
in addition the bracken tool will concatenate several files, thus creating a tabular report across all samples.
Sure. But in my workflow I'm combining many tools, so I needed consistency of format and noticed that a tool similar to kaiju2table wasn't available.