Question

MGS analysis and CheckM for binning quality control

0

Entering edit mode

2.2 years ago

Jonathan Yoou ▴ 60

Hi all,

I'm currently practicing MGS analysis, and I'm confused in overall processes and CheckM.

So, after shotgun sequencing, I got raw sequences in fastq format. In the files, there must be sequences of microbes as well as host or any other contaminants, right?

Then, I'm gonna assemble them into contigs, and bin them into assembled contigs using tool such as MetaBAT2. Assembled contigs are sequences which are very similar each other, and they could be genes, or just fragmented sequences from one species. I'm not sure whether I understand correctly so far, so please let me know if there is any incorrectly I got.

Next, among those assembled contigs, contaminated or low quality contigs will be filtered out through CheckM, right? CheckM require a directory containing contigs, but what I got from the binning tool is that just one fasta file containing contigs and their sequences.. Then do I have to segregate all contigs seperately, keep them into one directory, and run CheckM with them?? Or do I understand incorrectly?

Thank you in advance for all your comments!

checkm contigs binning qa • 867 views

ADD COMMENT • link updated 2.2 years ago by Mensur Dlakic ★ 27k • written 2.2 years ago by Jonathan Yoou ▴ 60

score 1 · Answer 1 · 2022-02-07

After the assembly, metabat2 will bin the contigs into groups. Binning is not done based on sequence similarity or presence/absence of genes, but rather based on statistical properties of DNA sequences (tetranucleotide frequencies, to be exact). This means that you start with a single fasta file (your assembly) and after metabat2 you get multiple fasta files which are subsets of the assembly, and are named based on the prefix you specified. If you don't have your assembly divided into individual fasta files (bins), something went wrong in the binning step. It may help if you specify a directory and a file prefix with metabat2. For example, -o metabat2/my_bins will create a directory metabat2 (if it doesn't exist already) and all the bins in it will have the my_bins prefix.

CheckM doesn't filter anything. It will take a directory of files created in the previous step and assess their completeness and contamination. That should look something like this:

  Bin Id                 Marker lineage             # genomes   # markers   # marker sets    0     1    2    3    4    5+   Completeness   Contamination   Strain heterogeneity
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  group_077            k__Archaea (UID2)               207         145           103         0     17   78   50   0    0       100.00          114.25             67.98
  group_059         c__Thermoprotei (UID147)            54         217           168         0    216   1    0    0    0       100.00           0.60              100.00
  group_054           k__Bacteria (UID203)             5449        104            58         0     93   5    6    0    0       100.00           2.66               0.00
  group_033           k__Bacteria (UID203)             5449        104            58         0    104   0    0    0    0       100.00           0.00               0.00
  group_002           k__Bacteria (UID209)             5443        105            59         1    104   0    0    0    0       99.66            0.00               0.00
  group_024         p__Euryarchaeota (UID3)            148         187           124         1    185   1    0    0    0       99.19            0.81               0.00