I am doing the analysis of three marine metagenomic samples. I am interested in retrieving Flavobacteria genomes, so first I co-assembled the three metagenomes using MegaHit, and then mapped the reads from the three metagenomes to the coassembly using Bowtie2. These data were fed to MetaBAT to calculate the differential abundances of each contig in each sample and making bins accordingly. As this is a complex community, I used the superspecific mode of MetaBat to reduce bin contamination as much as possible. I got around 300 bins, which I inspected looking at the probable origin of the contigs in them. I've got several good candidates for partial Flavobacterial genomes. Especially one catched my eye, since according checkM it was very complete (96%) and almost uncontaminated (1.5%). When checking the contigs within it, I realized that some had discordant abundances in the three samples regarding the common pattern of the bin. When I plotted these abundances, looked like this:
I am very surprised by this, since as you can see it is obvious that these outlier contigs do not belong to the bin, having such different abundances. Notice that while most contigs in the bin are almost not present in samples 1 and 2, the outliers are very abundant in these. Indeed, when I manually removed these contigs from the bin, completeness was unchanged and I removed almost all contamination (that is, I did not lose any "good" contig and eliminated alien ones).
My question is, how is that MetaBAT put these contigs in the bun regardless their very discordant abundances? Surely it has to do with the tetranucleotide profile of the contigs, but it is giving that so much weight? Could there be any way as to avoid this kind of situations?