Tool: kraken2table, an analogous to kaiju2table
3
gravatar for Macspider
7 weeks ago by
Macspider3.1k
Vienna - BOKU
Macspider3.1k wrote:

Hi everyone,

If you, like me, work with metagenomic data, you probably have used kaiju2table in the past. It's a tool provided with the Kaiju source code. It produces tsv tables that can easily be handled later on, for example for plotting.

The input data is the classic output format of Kaiju and also of Kraken:

C   A00700:50:HF7LGDRXX:1:1101:1000:10144#CCGGCATCATCTACGA  1578    100 1578:66
C   A00700:50:HF7LGDRXX:1:1101:1000:19225#CCGGCATCATCTACGA  186802  100 0:11 186802:55
C   A00700:50:HF7LGDRXX:1:1101:1000:23234#CCGGCATCATCTACGA  1578    100 1578:66

And can be from as many files as you need, which will be combined in one file containing percentages, like this:

file            percent             reads    taxon_id  taxon_name
F14_A_R1.s.out  59.89815343509682   7161673  0         Unclassified
F14_A_R1.s.out  1.080231644647389   129157   1301      Streptococcus
F14_A_R1.s.out  2.129960840275143   254667   1350      Enterococcus
F14_A_R1.s.out  1.3252716093792982  158455   1485      Clostridium
F14_A_R1.s.out  6.908532882384414   826013   1578      Lactobacillus
F14_A_R1.s.out  1.880613565083921   224854   204475    Gemmiger
F14_A_R1.s.out  3.163456075511585   378236   572511    Blautia
F14_A_R1.s.out  1.4769725746433902  176593   946234    Flavonifractor
F14_A_R1.s.out  1.0168598167829042  121580   1017280   Pseudoflavonifractor

As far as I could find, there is no such tool made for Kraken2, which is perhaps more used than Kaiju as a tool. You could, of course, try to use kaiju2table with the kraken results, but you would have to install Kaiju to have it.

Hence, for my own convenience I have made a tool called kraken2table that converts the *.out files produced by Kraken2 (mpa format) to *.tsv tables that resemble those produced by kaiju2table.

You can find it here:

https://github.com/MatteoSchiavinato/Utilities/blob/master/kraken2table

It depends on:

  • ete3
  • dask[complete]

The options are quite simple:

usage: kraken2table [-h] -i [INPUT_FILES [INPUT_FILES ...]] -o OUTPUT_FILE
                    [-p THREADS] [-r RANK] [-m MIN_FRAC] [-c MIN_COUNT] [-u]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT_FILES [INPUT_FILES ...]], --input-files [INPUT_FILES [INPUT_FILES ...]]
                        Name of input files (SPACE-separated).
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Name of output file.
  -p THREADS, --threads THREADS
                        Number of parallel threads
  -r RANK, --rank RANK  Taxonomic rank to be output, all lowercase (Default:
                        species)
  -m MIN_FRAC, --min-frac MIN_FRAC
                        Number in [0, 100], denoting the minimum required
                        percentage for the taxon (except viruses) to be
                        reported (default: 0.0)
  -c MIN_COUNT, --min-count MIN_COUNT
                        Integer number > 0, denoting the minimum required
                        number of reads for the taxon (except viruses) to be
                        reported (default: 0)
  -u, --exclude-unclassified
                        Unclassified reads are not counted for the total reads
                        when calculating percentages for classified reads.
ADD COMMENTlink modified 6 weeks ago • written 7 weeks ago by Macspider3.1k

To be honest it is not clear what this tool does, and that is perhaps the most important requirement of any software.

Kraken does produce various outputs and it is not clear in what way is this tool different and what it does.

PS you also say it depends on multiprocessing, why is that? your software does not seem to use multiprocessing

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Istvan Albert ♦♦ 84k

I edited answering your questions. Also, for the multiprocessing, yeah it was my mistake. My first version depended on multiprocessing, but using Dask allowed me to get rid of the multiprocessing module.

ADD REPLYlink written 6 weeks ago by Macspider3.1k

Thanks for editing the question.

Now that I understand it better I will mention that Kraken2 does have an output called report format that produces output in the following form:

 0.20   122 122 U   0   unclassified
 99.80  60878   0   R   1   root
 99.80  60878   0   R1  131567    cellular organisms
 99.80  60878   0   D   2759        Eukaryota
 99.80  60878   0   D1  33154         Opisthokonta
 99.80  60878   0   K   33208           Metazoa
 99.80  60878   0   K1  6072              Eumetazoa
 99.80  60878   0   K2  33213               Bilateria
 99.80  60878   0   K3  33511                 Deuterostomia
 99.80  60878   0   P   7711                    Chordata
 99.80  60878   0   P1  89593                     Craniata
 99.80  60878   0   P2  7742                        Vertebrata
 98.20  59901   0   P3  7776                          Gnathostomata
 98.20  59901   0   P4  117570                          Teleostomi
 98.20  59901   0   P5  117571                            Euteleostomi
 98.20  59901   0   P6  7898                                Actinopterygii
 98.20  59901   108 C   186623                                Actinopteri
 96.40  58801   6   C1  41665                                   Neopterygii
 93.17  56833   0   C2  32443                                     Teleostei
 93.17  56833   0   C3  1489341                                     Osteoglossocephalai
 93.17  56833   50  C4  186625                                        Clupeocephala

In addition the recommended workflow is to process the kraken2 report with bracken:

https://ccb.jhu.edu/software/bracken/index.shtml?t=manual

that will create an output that contains a column oriented output like so

  • Name
  • Taxonomy ID
  • Level ID (S=Species, G=Genus, O=Order, F=Family, P=Phylum, K=Kingdom)
  • Kraken Assigned Reads
  • Added Reads with Abundance Reestimation
  • Total Reads after Abundance Reestimation
  • Fraction of Total Reads

in addition the bracken tool will concatenate several files, thus creating a tabular report across all samples.

ADD REPLYlink written 6 weeks ago by Istvan Albert ♦♦ 84k

Sure. But in my workflow I'm combining many tools, so I needed consistency of format and noticed that a tool similar to kaiju2table wasn't available.

ADD REPLYlink written 6 weeks ago by Macspider3.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1603 users visited in the last hour