I would like to create a tree for my MAGs/Bins like the image shown below. Can anyone please share me the steps/scripts in detail?
When I used, gtdbtk de_novo_wf to analyze a set of bin files with -skip_gtdb_refs. I always ends up with the following error.
[2023-11-20 12:55:35] INFO: Read custom taxonomy for 45 genomes. [2023-11-20 12:55:35] INFO: Reassigned taxonomy for 45 GTDB representative genomes. [2023-11-20 12:55:35] ERROR: GTDB-Tk classification and custom taxonomy files must not specify taxonomies for the same genomes. [2023-11-20 12:55:35] ERROR: These files have 45 genomes in common. [2023-11-20 12:55:35] ERROR: Example duplicate genome: bin.18 [2023-11-20 12:55:35] ERROR: Duplicated taxonomy information. [2023-11-20 12:55:35] ERROR: Controlled exit resulting from an unrecoverable error or warning.
my script is below, gtdbtk de_novo_wf --genome_dir /zfs/camplab/Jojy/darpa_working/drep-output_directory/dereplicated_genomes --out_dir /zfs/camplab/Jojy/darpa_working/gtdbtk_oct2023/de_novo_new --extension fa --bacteria --gtdbtk_classification_file /zfs/camplab/Jojy/darpa_working/gtdbtk_oct2023/gtdbtk.bac120.summary.tsv --cpus 40 --outgroup_taxon p__Chloroflexota --skip_gtdb_refs --custom_taxonomy_file /zfs/camplab/Jojy/darpa_working/gtdbtk_oct2023/CUSTOM_TAXONOMY_FILE
These genomes have actually already been analyzed with classify_wf, with the taxonomy information obtained. So used gtdbtk.bac120.summary.tsv as --gtdbtk_classification_file and I made a custom_taxonomy file from the same summary. Both are attached with this.
my custom_taxonomy_file is below bin.1 d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__;f__;g__;s__ bin.10 d__Bacteria;p__Planctomycetota;c__Planctomycetia;o__Planctomycetales;f__Planctomycetaceae;g__;s__ bin.11 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__UBA4575;f__UBA4575;g__JABDMD01;s__ bin.13 d__Bacteria;p__Tectomicrobia;c__Entotheonellia;o__Entotheonellales;f__Entotheonellaceae;g__Entotheonella;s__ bin.14 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__UBA4486;f__UBA4486;g__;s__ bin.15 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__UBA6522;f__UBA6522;g__;s__ bin.16 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__GCA-001735895;f__GCA-001735895;g__GCA-001735895;s__ bin.17 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Woeseiales;f__Woeseiaceae;g__SZUA-117;s__ bin.18 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudohongiellaceae;g__UBA5109;s__ bin.19 d__Bacteria;p__Planctomycetota;c__Planctomycetia;o__Pirellulales;f__;g__;s__ bin.2 d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Rhizobiaceae;g__JAALLB01;s__ bin.20 d__Bacteria;p__Verrucomicrobiota;c__Verrucomicrobiae;o__Verrucomicrobiales;f__DEV007;g__;s__ bin.22 d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Methyloligellaceae;g__MnTg02;s__ bin.24 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__UBA6186;f__UBA6186;g__;s__ bin.25 d__Bacteria;p__Planctomycetota;c__Planctomycetia;o__Pirellulales;f__Pirellulaceae;g__Mariniblastus;s__ bin.26 d__Bacteria;p__Planctomycetota;c__Planctomycetia;o__Pirellulales;f__Lacipirellulaceae;g__Bythopirellula;s__ bin.27 d__Bacteria;p__Actinobacteriota;c__Acidimicrobiia;o__Acidimicrobiales;f__UBA11606;g__;s__ bin.28 d__Bacteria;p__Planctomycetota;c__PLA2;o__PLA2;f__JAEUHO01;g__;s__ bin.29 d__Bacteria;p__Planctomycetota;c__Planctomycetia;o__Pirellulales;f__Pirellulaceae;g__GCA-2726245;s__ bin.3 d__Bacteria;p__Planctomycetota;c__Planctomycetia;o__Pirellulales;f__Pirellulaceae;g__GCA-2723275;s__ bin.30 d__Bacteria;p__Acidobacteriota;c__Vicinamibacteria;o__Bin61;f__SMYC01;g__;s__ bin.31 d__Bacteria;p__Acidobacteriota;c__Thermoanaerobaculia;o__UBA5704;f__UBA5704;g__;s__ bin.32 d__Bacteria;p__Planctomycetota;c__Planctomycetia;o__Pirellulales;f__Lacipirellulaceae;g__Bythopirellula;s__ bin.33 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__UBA6522;f__UBA6522;g__;s__
I used gtdbtk.bac120.summary.tsv as --gtdbtk_classification_file , the table attached below.
Please help me.
Thank you in advance, ** image source (source Bandla et al.,2020)
Thank you for the quick response.
When I use it without --gtdbtk_classification_file . I face another error.
[2023-11-21 10:14:08] ERROR: Uncontrolled exit resulting from an unexpected error.
================================================================================ EXCEPTION: TypeError MESSAGE: Population must be a sequence. For dicts or sets, use sorted(d).
Traceback (most recent call last): File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/site-packages/gtdbtk/__main__.py", line 101, in main gt_parser.parse_options(args) File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/site-packages/gtdbtk/main.py", line 1051, in parse_options self.root(options) File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/site-packages/gtdbtk/main.py", line 776, in root reports = reroot.root_with_outgroup(options.input_tree, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/site-packages/gtdbtk/reroot_tree.py", line 83, in root_with_outgroup rnd_ingroup = random.sample(ingroup_leaves, 1)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/random.py", line 439, in sample raise TypeError("Population must be a sequence. "
TypeError: Population must be a sequence. For dicts or sets, use sorted(d).
I am really new to GTDBtk. Can you please help me with this. I already update the aligner too. I am working on HPC
I don't have the capacity to troubleshoot every single problem you may encounter with this program. There is a GitHub site where you can explain the files you used, the command and the error in greater detail. They should be able to help.