What are unbinned contigs ?
2
0
Entering edit mode
5 months ago

Hello community, by taking a look into the number of binned contigs from a binning refinement procedure I notice that the number of contigs employed to make bins or MAGs are less than half of the number of contigs produced by the assembly procedure.

For example, the number of contigs produced by using MEGAHIT is 1.158.798 sequences. When I concatenate all the contigs from the universe of produced bins, I get 286.102 contigs, so what happens with the rest of the contigs?

A first explanation for this could be the number of conserved loci that are present in the contigs also there could be a number of viral or eukaryotic sequences that are not binned but not sure at what extend this proportion is higher than the conserved prokaryotic sequences. so what exactly are these unbinned contigs? Could it be precipitated to say that the most part of these sequences are from eukaryotic species?

binning MAGs metagenomics • 672 views
ADD COMMENT
3
Entering edit mode
5 months ago

There are lots of ways to do binning, so unbinned contigs could be anything. If the binning is reference-intensive (e.g. BLASTing to known organisms and combining contigs with similar hits) then these could be contigs with no closely related assemblies. If the binning uses kmer frequencies (or various other approaches), they could be too short to get a good representation of the organism's signature, so they get left out of the correct bin. It it's machine learning, perhaps nothing close to organism was represented in the training set. If it's based on assembly graph traversal, anything very low-depth will likely not be contiguous enough to have a connected graph.

But generally they are just short, which could be due to low depth or a highly polymorphic region in that community (which could be highly conserved but in many weakly-related organisms or not highly conserved but present in a high-abundance species). There's no reason why euks can't be binned. And the unbinned contigs can be binned too, if you want, but while binning tools try to minimize both false-positive and false-negative calls, false-positives (which affect purity) are more important, so default settings will typically classify contigs with low confidence as unbinned even though there is a bin that they best match.

ADD COMMENT
0
Entering edit mode

Thanks so much for this detailed answer, gave me material to think, specially on the gtbtk classification in the sense of 'at what extent can all the bins be classified as bacteria and archaea and none of them as unclassified ? (being that the db only contain genomes from those kingdoms)'. These are antarctic soil metagenomes (putatively relative high alpha diversity) so I hoped to rescue at least some euks.

ADD REPLY
2
Entering edit mode
5 months ago
Mensur Dlakic ★ 27k

Generally speaking, unbinned contigs come from default cutoffs in binning programs or from the clustering procedure. There is no point in binning small contigs, as there isn't enough signal in them. If you consider a 500 bp fragment, there are 497 different tetranucleotides (4Ns) in it. Given a total number of 256 4Ns (or 136 if you count complementary 4Ns together), there would be on average fewer than 2 counts for any one of 4Ns. That's not enough to distinguish different groups. Most binning programs make an automatic length cutoff if you don't specify one, and throw away the contigs shorter than that size. Pretty sure that MetaBat2 has it at 1.5 Kb, and I rarely deal with anything smaller than 2 Kb. That will eliminate a large number of contigs.

Separately, some contigs will be far enough in the reduced dimensionality space from the main clusters that they will not be binned. Unless the algorithm is using Gaussian Mixture Models or some such approach that will cluster everything, those wayward contigs will be set aside.

ADD COMMENT

Login before adding your answer.

Traffic: 2840 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6