meaning of CheckM output
15 months ago
zhangdengwei ▴ 150

Hi all,

I utilized CheckM to estimate whether my bacteria came from single colony have been contaminated. Here is the result:

  Bin Id                      Marker lineage           # genomes   # markers   # marker sets   0     1      2    3    4    5+   Completeness   Contamination   Strain heterogeneity
P14_4_chromosome         k__Bacteria (UID203)           5449        104            58        0     39     32   31   2    0       100.00          131.87             91.24
H13_5_chromosome         k__Bacteria (UID203)           5449        104            58        0     45     55   2    2    0       100.00          91.18              93.15
H18_4_chromosome   f__Enterobacteriaceae (UID5124)      134         1173          336        1    1169    3    0    0    0       99.97            0.15               0.00
H15_3_chromosome   f__Enterobacteriaceae (UID5124)      134         1173          336        1    1170    2    0    0    0       99.97            0.33               0.00
H13_3_chromosome   f__Enterobacteriaceae (UID5124)      134         1173          336        1    1168    4    0    0    0       99.97            0.09               0.00
H13_7_chromosome   f__Enterobacteriaceae (UID5162)       88         1207          328        2    1192    12   1    0    0       99.93            1.28              13.33
P15_2_chromosome   f__Enterobacteriaceae (UID5124)      134         1172          336        1    1169    2    0    0    0       99.90            0.33               0.00
H12_1_chromosome   f__Enterobacteriaceae (UID5124)      134         1173          336        2    1170    1    0    0    0       99.67            0.04               0.00
H14_5_chromosome   f__Enterobacteriaceae (UID5124)      134         1173          336        4    1168    1    0    0    0       99.37            0.04               0.00
I am a bit confused about the meaning of completeness and contamination. Taking P14_4 as an example, the completeness is 100 while the contamination is 131.87. What do they represent? Besides, are the completeness and contamination based on the Maker lineage? Any advise would be greatly appreciated!

15 months ago
Asaf 8.6k

ChackM uses single-copy genes to evaluate the completeness and contamination of a genome (or a pseudo-genome). If all the genes are found in the genome then completeness is 100% (since they are all essential proteins). If they appear more than once then it's probably contaminated (because two copies are usually lethal). So for P14_4 you can see that there are 104 markers, 39 of which appear only once, 32 appear twice and 31 three times (2 appear 4 times) so since all the genes are found the genome is probably complete, but since there are multiple copies you are probably looking at 2.3 genomes instead of one (that's 130% contamination), 91.24% of the contamination is probably from another strain of the main bacteria.

So overall most of your genomes look very good, P14_4 is 2.3 genomes and H13_5 is two genomes of two strains.

Asaf, may I ask one more question? What's the meaning of Marker lineage? Based on my understanding, 2.3 genomes in P14_4 belongs to k__Bacteria but failed to be sub-divided into f__Enterobacteriaceae, right?

Exactly. CheckM couldn't assign this genome into a lower level of taxonomy, potentially because it was contaminated with a bacterium from another phylum.