MGS analysis and CheckM for binning quality control
1
0
Entering edit mode
2.2 years ago

Hi all,

I'm currently practicing MGS analysis, and I'm confused in overall processes and CheckM.

So, after shotgun sequencing, I got raw sequences in fastq format. In the files, there must be sequences of microbes as well as host or any other contaminants, right?

Then, I'm gonna assemble them into contigs, and bin them into assembled contigs using tool such as MetaBAT2. Assembled contigs are sequences which are very similar each other, and they could be genes, or just fragmented sequences from one species. I'm not sure whether I understand correctly so far, so please let me know if there is any incorrectly I got.

Next, among those assembled contigs, contaminated or low quality contigs will be filtered out through CheckM, right? CheckM require a directory containing contigs, but what I got from the binning tool is that just one fasta file containing contigs and their sequences.. Then do I have to segregate all contigs seperately, keep them into one directory, and run CheckM with them?? Or do I understand incorrectly?

Thank you in advance for all your comments!

checkm contigs binning qa • 867 views
ADD COMMENT
1
Entering edit mode
2.2 years ago
Mensur Dlakic ★ 27k

After the assembly, metabat2 will bin the contigs into groups. Binning is not done based on sequence similarity or presence/absence of genes, but rather based on statistical properties of DNA sequences (tetranucleotide frequencies, to be exact). This means that you start with a single fasta file (your assembly) and after metabat2 you get multiple fasta files which are subsets of the assembly, and are named based on the prefix you specified. If you don't have your assembly divided into individual fasta files (bins), something went wrong in the binning step. It may help if you specify a directory and a file prefix with metabat2. For example, -o metabat2/my_bins will create a directory metabat2 (if it doesn't exist already) and all the bins in it will have the my_bins prefix.

CheckM doesn't filter anything. It will take a directory of files created in the previous step and assess their completeness and contamination. That should look something like this:

  Bin Id                 Marker lineage             # genomes   # markers   # marker sets    0     1    2    3    4    5+   Completeness   Contamination   Strain heterogeneity
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  group_077            k__Archaea (UID2)               207         145           103         0     17   78   50   0    0       100.00          114.25             67.98
  group_059         c__Thermoprotei (UID147)            54         217           168         0    216   1    0    0    0       100.00           0.60              100.00
  group_054           k__Bacteria (UID203)             5449        104            58         0     93   5    6    0    0       100.00           2.66               0.00
  group_033           k__Bacteria (UID203)             5449        104            58         0    104   0    0    0    0       100.00           0.00               0.00
  group_002           k__Bacteria (UID209)             5443        105            59         1    104   0    0    0    0       99.66            0.00               0.00
  group_024         p__Euryarchaeota (UID3)            148         187           124         1    185   1    0    0    0       99.19            0.81               0.00
ADD COMMENT
0
Entering edit mode

Hi, thank you for your detailed reply. I think I misunderstand the binning process and MAGs, even since I've experienced 16S amplicon sequences with their OTUs, so I thought they are almost same.

So, after assembling raw sequences, the assembled contigs will be binned (or clustered) with statistical similarity on DNA sequences (suppose like, they have similar abundance or GC contents so they should be from same species), not sequence similarity itself. And, of course, sequences clustered into same bin would be all different.

This is why I'm gonna get several files of bins (several species containing bunch of contigs) from one MGS raw sequence, and CheckM assess their completeness and contamination as fastqc/cutadapt/trimmomatic from other genomic analyses.

Do I understand right?

ADD REPLY
0
Entering edit mode

Bins are based on tetranucleotide frequencies: sequences with similar 4n frequencies will end up going into the same bin.

Bin completeness and contamination are assessed based on presence/absence of certain marker genes (122 for archaea, 120 for bacteria), which is different from read statistics provided by fastqc.

ADD REPLY

Login before adding your answer.

Traffic: 3449 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6