Question

Help me with bacteria classification

0

Entering edit mode

23 months ago

Giulia.cosenza ▴ 100

Hi, I have a fastQ file obtained with a metagenomic untargeted sequencing. I performed a taxonomic analysis of it with kraken2, and my output looks like this:

enter image description here

The fields of the output, from left-to-right, are as follows:

-Percentage of fragments covered by the clade rooted at this taxon

-Number of fragments covered by the clade rooted at this taxon

-Number of fragments assigned directly to this taxon

-A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.

-NCBI taxonomic ID number

-Indented scientific name

I'd like to understand the nature of all the different species obtained, in particular I'd like to know their source of isolation, if they are pathogen or not, etc...

What is the best way to do that?

Someone suggested me to use EntrezDirect like this:

$ esearch -db biosample -query SAMN10026047 | efetch

1: Corallococcus genome_CA054A

Identifiers: BioSample: SAMN10026047; Sample name: Corallococcus CA054A

Organism: Corallococcus terminator

Attributes:

/strain="CA054A"

/isolation source="soil"

/collection date="2016-09-28"

/geographic location="United Kingdom"

/sample type="Bacterial Isolate"

/identified by="Aberystwyth University"

/type-material="type strain of Corallococcus terminator"

Accession: SAMN10026047 ID: 10026047

But I do not know the accession number of the species, I only have their name and their Taxonomic ID.

Bacteria sra NCBI • 557 views

ADD COMMENT • link updated 23 months ago by Istvan Albert 100k • written 23 months ago by Giulia.cosenza ▴ 100

score 1 · Answer 1 · 2022-05-19

The default Kraken2 database operates on the RefSeq data via a so-called assembly summary table that connects a TaxID to an assembly id and a sample name.

The structure of that table is described here:

https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt

The table for bacteria can be found as assembly_summary_refseq.txt here:

https://ftp.ncbi.nlm.nih.gov/genomes/refseq/

From that you can figure out how the taxids are connected to other information.