You're welcome, and I'm glad you're finding BBNorm useful!
Dedupe is designed to simplify assemblies by removing sequences that are exact duplicates, or when one sequences is completely contained within the other, within some maximum hamming distance or edit distance. It does not remove or merge sequences that merely overlap, with neither fully contained by the other (though I do want to add merging some day). The clustering is based on transitive overlap; if sequence A overlaps B, and B overlaps C, and C overlaps D, then A, B, C, and D will all be in the same cluster.
Typically a cluster (with sufficient depth) will contain lots of redundant overlaps, such that A, B, C, and D all overlap each other, for 6 total overlaps instead of just 3. Dedupe can break cycles in a cluster's overlap graph to generate a canonical traversal, and it can even do so by building an MST, which will ensure that the cycles are broken by removing the shortest overlaps, so the resulting tree only has the longest overlaps (MST mode takes longer, though, mainly because it is incompatible with the 'preventTransitiveOverlaps' flag, which makes dense graphs much faster). The resulting tree can be printed in dot (GraphViz) format, and could be analyzed to remove unnecessary sequences by another program, but Dedupe can't currently remove the sequences that are subsumed by overlap. In fact doing that optimally is probably NP-complete, but it would not be too difficult to do greedily.
I will put that on my list of things to consider adding, though.
fixmultijoins flag is there to remove redundant overlaps that are sometimes created because Dedupe runs multit-hreaded, and two different threads can sometimes create the same overlap at the same time; it doesn't remove anything useful - just when you plot the graph afterward, there won't be any nodes connected by duplicate edges.
"out" will get ALL reads that survived deduplication, regardless of whether they are clustered. However, I do have another program,
filterbyname.sh, that can remove all sequences from a file that either share or don't share names with sequences in another file:
filterbyname.sh in=a.fq names=b.fq out=c.fq include=f
That would yield
c.fq, containing the intersection of
b.fq. So if
a.fq were clustered reads and
b.fq were all reads, then c.fq would only contain unclustered reads. Names can also be a comma-delimited list of files.
A typical command for clustering like you are doing is something like this:
dedupe.sh in=x.fq am=t ac=t fo c rnc=f mcs=2 mo=100 cc pto=t csf=stats.txt out=all.fq pattern=cluster_%.fq qin=33 dot=graph.dot
If you want the output graph to contain only maximal overlaps trees, then change
Tagging Brian Bushnell
Thanks for notifying me! :)
You're welcome. Also, I guess you're following the bbtools tag - if not, that should notify you of any new posts with that tag.
Thanks, had no idea I could tag a specific user.
You're welcome! To tag a user, copy-paste their biostars/u/ URL and it becomes the username. The user is notified as well.
https://www.biostars.org/u/15973/becomes Charles Pepe-Ranney