Question

Using metagenome assembly and binning to identify and mitigate contamination in a genome

0

Entering edit mode

4 months ago

btc347 • 0

Hi everyone,

This may be a silly question, but I am interested if using metagenome assembly and binning is a valid method of determining if a sample contains a mixture of species. Similarly, can metagenomics be used to identify and remove contamination from a single genome?

For some background, I was recently asked to assemble bacterial genomes from five samples sequenced on Illumina MiSeq. I was told that these samples were pure, so I ran the raw reads through a quick de novo draft genome assembly pipeline: BBDuk for adapter removal and quality filtering > SPAdes in “isolate” mode > Quast and lineage-specific CheckM. It seemed that the draft genomes were of okay quality.

draft genome Quast and CheckM results

However, I was a bit concerned by the level of contamination. I was later told that that the five samples were actually enrichments of a specific genus of bacteria from an environmental sample; there was therefore no guarantee that each sample contained only a single species. I also learned that additional Illumina sequencing had been conducted on other passages from the original 5 samples, and in some of these runs less than 50% of the reads mapped to the draft genomes I had generated.

This all led me to suspect that the original samples could contain a mixture of species, but I didn’t have access to reference genomes and therefore couldn’t use something like BBSplit to parse the raw reads. I thought that if the samples were mixed species then metagenome assembly and binning could generate MAGs for each species in the sample. So I ran the following pipeline on the reads from the original 5 samples: BBDuk with the same parameters as before > MetaSPAdes > MetaQuast > MetaBAT2, MaxBin2, and CONCOCT for binning > DAS Tool for optimizing bins > Quast and lineage-specific CheckM. For each sample, MetaQuast BLASTn identified two references: a species belonging to the enriched genus and a genome of Streptococcus pneumoniae. However, none of the contigs in any sample aligned to S. pneumoniae. DAS Tool identified only a single bin for each sample, and these bins had >98% ANI to the corresponding original draft genomes. Quast indicated that the bins generally had fewer contigs than the corresponding original draft genomes, and I assume that this explains the majority of the differences in total length, GC content, etc. CheckM found that for most of the samples, the bins had slightly lower completeness and slightly lower contamination.

metagenome bin Quast and CheckM results

Based on the MetaQuast and DAS Tool output, I think that luckily the original enrichments were mostly pure. This probably means that, in the passages with low mapping to the draft genomes, the other species had grown to higher proportions.

So is using a metagenomics pipeline a valid means of determining if the original samples contained a mixture of species? This is again assuming that using something like BBSplit is not possible due to lack of reference genomes. If so, can I be reasonably confident from my analysis that the original samples contain primarily a single species? Additionally, does metagenomic assembly and binning work to remove contaminant contigs? If so, should I consider the bins/MAGs I generated to be “better” than the original draft genomes because they tend to have fewer short contigs and less contamination (though the bins/MAGs tend to have lower completeness)?

Thanks in advance!

genomes contamination metagenomics • 1.1k views

ADD COMMENT • link 4 months ago by btc347 • 0

1

Entering edit mode

It's not a silly question! And yes, it can be done. JGI is currently testing various binning tools to try to find the best protocol for this kind of decontamination. We get a lot of situations where particularly a fungus is mixed with some associated bacteria, or an "enrichment" ended up with 2 or 3 species. But, I don't have any recommendations yet.

I generally use SendSketch to determine if a library appears to contain multiple species since it's fast and can run before any other processing. It also gives completeness and contamination estimates, though their accuracy depends on how closely related the sample is to something already in the reference set (RefSeq).

One of the other things we do at JGI to identify likely contamination in the assembly is a gc/coverage plot; basically, map the reads to the contigs and plot a 2D scatter plot with contig GC on one axis and average coverage on the other. We also color the dots by taxa of best BLAST (or Sketch) hit of the contig. So most of the time, when there is contamination (or an organelle) you will see two little clouds of dots at different GC and coverage, with many of the dots having the same color. Not only can this be used to identify contaminated assemblies, but in many cases it can separate the organisms - particularly when you have low-level contamination and can simply use a coverage cutoff, or the organisms have very different GC. For more complex cases a dedicated binning tool could be better.

ADD REPLY • link 4 months ago by Brian Bushnell 20k

score 1 · Answer 1 · 2023-12-01

1

Entering edit mode

4 months ago

Mensur Dlakic ★ 27k

It is a valid question, and I particularly like when posters err on the side of providing more than less detail. Metagenomic binning can be used to weed out contamination. This assumes that a contaminant is not super-similar to one of the organisms of interest. t-SNE can be used to separate the contaminants from the actual genome of interest.

I wouldn't worry about the level of contamination <10%, except maybe for genome 3. Some of it can be due to large sequencing error stemming from very deep sequencing, and genome fragmentation (>50 contigs for all genomes except #3) hints at very deep sequencing. So, what is the average coverage depth for each of these genomes?

If the depth of sequencing is >500x, and especially if >1000x, sequencing errors will become significant enough that even a single genome will assemble such that what should be identical fragments end up with 97-99% identity (accumulated errors). That will give the appearance of multiple related strains and consequently of contamination. If you want to convince yourself of this, take a single genome and simulate Illumina reads from it at 0.5-1% error and 2000x coverage. You will see that upon assembly there will be a low-level contamination similar to numbers you have.

ADD COMMENT • link 4 months ago by Mensur Dlakic ★ 27k

2

Entering edit mode

In most cases error-correction should take care of error-spawned fake minor alleles, though...

If you want to convince yourself of this, take a single genome and simulate Illumina reads from it at 0.5-1% error and 2000x coverage. You will see that upon assembly there will be a low-level contamination similar to numbers you have.

This, in particular, sounds unlikely unless you are not performing error-correction. Normalization can also resolve it.

Now personally, I don't run CheckM so I'm not sure what's normal there, but the Quast reports indicates a typical bacteria genome size and decent contiguity such that the contig lengths do not drop off sharply with length as you might expect with two bacteria with highly differential coverage. So I don't see any indication that @btc347 has a particular contamination issue, but I'd be hesitant to chalk up CheckM's analysis to spurious contigs generated from sequencing errors. I've never seen that happen. Spades does generate "spurious" (questionable since they do, in fact, exist) contigs sometimes when assembling rapidly-mutating things like viruses, but it does not generally bloat assembly sizes by up to 12% due to sequencing error if you perform proper preprocessing. However, of course, some bacteria DO have multiple copies of "single-copy genes"... or they might be just similar.

ADD REPLY • link 4 months ago by Brian Bushnell 20k

1

Entering edit mode

All good points, especially about multiple copies of single-copy genes. I am doing error-correction in my assemblies, but was making an educated guess that the OP didn't do it based on a relatively large number of contigs. SPAdes does internal error-correction, but in my hands that is often not enough. I also know from first-hand experience that simulations of deep sequencing - done on 3 genomes in my case - give contamination profiles similar to what is described here. Again, I wouldn't be worried with these levels of contamination.

ADD REPLY • link 4 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you both very much for your input! Below is some additional information related to the points you both brought up.

Regarding error correction, I was relying solely on the internal error correction of SPAdes. This has worked for me in the past, though I've only worked with sequencing from pure cultures previously. I'll get some error correction integrated into my pipeline going forward.
The sequencing depth for all 5 draft genomes was <200x.
I know this genus has some degree of gene duplication, so that supports the idea that some of the contamination calculated by CheckM is due to multiple copies of traditionally "single-copy" genes.

The group I'm assisting is currently trying to isolate single colonies from their original environmental samples. The plan is to sequence these isolates with Illumina and Nanopore for hybrid assembly, so that should help. Thank you both again!

ADD REPLY • link 4 months ago by btc347 • 0