This may be a silly question, but I am interested if using metagenome assembly and binning is a valid method of determining if a sample contains a mixture of species. Similarly, can metagenomics be used to identify and remove contamination from a single genome?
For some background, I was recently asked to assemble bacterial genomes from five samples sequenced on Illumina MiSeq. I was told that these samples were pure, so I ran the raw reads through a quick de novo draft genome assembly pipeline: BBDuk for adapter removal and quality filtering > SPAdes in “isolate” mode > Quast and lineage-specific CheckM. It seemed that the draft genomes were of okay quality.
However, I was a bit concerned by the level of contamination. I was later told that that the five samples were actually enrichments of a specific genus of bacteria from an environmental sample; there was therefore no guarantee that each sample contained only a single species. I also learned that additional Illumina sequencing had been conducted on other passages from the original 5 samples, and in some of these runs less than 50% of the reads mapped to the draft genomes I had generated.
This all led me to suspect that the original samples could contain a mixture of species, but I didn’t have access to reference genomes and therefore couldn’t use something like BBSplit to parse the raw reads. I thought that if the samples were mixed species then metagenome assembly and binning could generate MAGs for each species in the sample. So I ran the following pipeline on the reads from the original 5 samples: BBDuk with the same parameters as before > MetaSPAdes > MetaQuast > MetaBAT2, MaxBin2, and CONCOCT for binning > DAS Tool for optimizing bins > Quast and lineage-specific CheckM. For each sample, MetaQuast BLASTn identified two references: a species belonging to the enriched genus and a genome of Streptococcus pneumoniae. However, none of the contigs in any sample aligned to S. pneumoniae. DAS Tool identified only a single bin for each sample, and these bins had >98% ANI to the corresponding original draft genomes. Quast indicated that the bins generally had fewer contigs than the corresponding original draft genomes, and I assume that this explains the majority of the differences in total length, GC content, etc. CheckM found that for most of the samples, the bins had slightly lower completeness and slightly lower contamination.
Based on the MetaQuast and DAS Tool output, I think that luckily the original enrichments were mostly pure. This probably means that, in the passages with low mapping to the draft genomes, the other species had grown to higher proportions.
So is using a metagenomics pipeline a valid means of determining if the original samples contained a mixture of species? This is again assuming that using something like BBSplit is not possible due to lack of reference genomes. If so, can I be reasonably confident from my analysis that the original samples contain primarily a single species? Additionally, does metagenomic assembly and binning work to remove contaminant contigs? If so, should I consider the bins/MAGs I generated to be “better” than the original draft genomes because they tend to have fewer short contigs and less contamination (though the bins/MAGs tend to have lower completeness)?
Thanks in advance!