Hi all,
I'm currently practicing MGS analysis, and I'm confused in overall processes and CheckM.
So, after shotgun sequencing, I got raw sequences in fastq format. In the files, there must be sequences of microbes as well as host or any other contaminants, right?
Then, I'm gonna assemble them into contigs, and bin them into assembled contigs using tool such as MetaBAT2. Assembled contigs are sequences which are very similar each other, and they could be genes, or just fragmented sequences from one species. I'm not sure whether I understand correctly so far, so please let me know if there is any incorrectly I got.
Next, among those assembled contigs, contaminated or low quality contigs will be filtered out through CheckM, right? CheckM require a directory containing contigs, but what I got from the binning tool is that just one fasta file containing contigs and their sequences.. Then do I have to segregate all contigs seperately, keep them into one directory, and run CheckM with them?? Or do I understand incorrectly?
Thank you in advance for all your comments!
Hi, thank you for your detailed reply. I think I misunderstand the binning process and MAGs, even since I've experienced 16S amplicon sequences with their OTUs, so I thought they are almost same.
So, after assembling raw sequences, the assembled contigs will be binned (or clustered) with statistical similarity on DNA sequences (suppose like, they have similar abundance or GC contents so they should be from same species), not sequence similarity itself. And, of course, sequences clustered into same bin would be all different.
This is why I'm gonna get several files of bins (several species containing bunch of contigs) from one MGS raw sequence, and CheckM assess their completeness and contamination as fastqc/cutadapt/trimmomatic from other genomic analyses.
Do I understand right?
Bins are based on tetranucleotide frequencies: sequences with similar 4n frequencies will end up going into the same bin.
Bin completeness and contamination are assessed based on presence/absence of certain marker genes (122 for archaea, 120 for bacteria), which is different from read statistics provided by
fastqc
.