Question: Binning Tools for Long Reads/Contigs
gravatar for vijinim
6 days ago by
vijinim90 wrote:

Majority of the currently available metagenomics binning tools are designed to work with short reads and contigs obtained from short reads.

Does someone know if there are any tools available to bin long reads or contigs obtained from long reads?

Thank you very much! :)

ADD COMMENTlink modified 3 days ago • written 6 days ago by vijinim90

What is the difference between binning long contiguous sequences assembled from short reads and binning long contiguous sequences obtained from long reads?

ADD REPLYlink written 6 days ago by 5heikki8.2k

I believe there is no difference apart from the effects of the error rates of short reads and long reads.

However, I tried to bin a simulated dataset of reads from 2 bacterial genomes (with 20kb - 21kb read lengths and 10% error rate) and the tool failed to identify two bins. It produced only one bin with a few sequences and most of the remaining sequences were not binned. The tool used is MaxBin 2.2.4

ADD REPLYlink modified 5 days ago • written 5 days ago by vijinim90

And how different where the two genomes? No tool will successfully separate e.g. Escherichia coli O157:H7 Sakai and Escherichia coli O157:H7 EC4115..

ADD REPLYlink written 4 days ago by 5heikki8.2k

I used Escherichia coli CFT073 and Staphylococcus aureus JP080. When we get short reads and bin the contigs, MaxBin produces 2 bins with good results.

Similarly, I tried MaxBin with long reads from the same 2 genomes but it gave only 1 bin.

ADD REPLYlink written 3 days ago by vijinim90

Does maxbin use also depth of coverage? That could be the reason as you don't get that dimension with long reads..

ADD REPLYlink written 3 days ago by 5heikki8.2k

In this approach, tetranucleotide frequencies and scaffold coverages are combined to organize metagenomic sequences into individual bins, which are predicted from initial identification of marker genes in assembled sequences.


Despite careful selection of initialization conditions, the EM algorithm sometimes may still group scaffolds from several composite genomes into one bin. To alleviate this problem, all bins are recursively checked for the median number of marker genes. If the median number of marker genes of any bin is at least 2, the bin will be treated as a dataset waiting to be binned, and the whole EM algorithm will be applied to split the bin.

In case MaxBin works at the protein level for the detection of those marker genes, I think your 10% simulated error rate will lead to a single bin..

ADD REPLYlink modified 2 days ago • written 2 days ago by 5heikki8.2k

I think Kraken (and possibly centrifuge) can take long reads. Kraken I’m fairly sure can work on contigs too.

ADD REPLYlink written 3 days ago by jrj.healey11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1816 users visited in the last hour