Question

dedupe.sh output from BBTools

1

Entering edit mode

9.5 years ago

Charles Pepe-Ranney ▴ 10

Wondering what the dedupe contig overlap output actually is from BBtools. Can't quite figure it out from the documentation. Clusters are reads that overlap and the regular outfile contaings the rest of the reads with containments/duplicates removed? If I invoke the processclusters flag with fixmultijoins=t will the clusters no longer contain the overlapping sequence?

...thanks so much for getting the tools out there. BBNorm is way faster than I anticipated! We would really like to use dedupe such that the resulting file has only one instance of each overlapping sequence (in addition to removing contaminants and duplicates). Is this possible with dedupe?

Cheers,
-Chuck

bbtools dedupe bbmap deduplicate-contigs • 6.3k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by Charles Pepe-Ranney ▴ 10

1

Entering edit mode

Tagging Brian Bushnell

ADD REPLY • link 2.3 years ago by Ram 44k

1

Entering edit mode

Thanks for notifying me! :)

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

0

Entering edit mode

You're welcome. Also, I guess you're following the bbtools tag - if not, that should notify you of any new posts with that tag.

ADD REPLY • link 2.3 years ago by Ram 44k

0

Entering edit mode

Thanks, had no idea I could tag a specific user.

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by Charles Pepe-Ranney ▴ 10

0

Entering edit mode

You're welcome! To tag a user, copy-paste their biostars/u/ URL and it becomes the username. The user is notified as well.

Example: https://www.biostars.org/u/15973/ becomes Charles Pepe-Ranney

ADD REPLY • link 2.3 years ago by Ram 44k

Ram · Accepted Answer · 2015-01-27

Chuck,

You're welcome, and I'm glad you're finding BBNorm useful!

Dedupe is designed to simplify assemblies by removing sequences that are exact duplicates, or when one sequences is completely contained within the other, within some maximum hamming distance or edit distance. It does not remove or merge sequences that merely overlap, with neither fully contained by the other (though I do want to add merging some day). The clustering is based on transitive overlap; if sequence A overlaps B, and B overlaps C, and C overlaps D, then A, B, C, and D will all be in the same cluster.

Typically a cluster (with sufficient depth) will contain lots of redundant overlaps, such that A, B, C, and D all overlap each other, for 6 total overlaps instead of just 3. Dedupe can break cycles in a cluster's overlap graph to generate a canonical traversal, and it can even do so by building an MST, which will ensure that the cycles are broken by removing the shortest overlaps, so the resulting tree only has the longest overlaps (MST mode takes longer, though, mainly because it is incompatible with the 'preventTransitiveOverlaps' flag, which makes dense graphs much faster). The resulting tree can be printed in dot (GraphViz) format, and could be analyzed to remove unnecessary sequences by another program, but Dedupe can't currently remove the sequences that are subsumed by overlap. In fact doing that optimally is probably NP-complete, but it would not be too difficult to do greedily.

I will put that on my list of things to consider adding, though.

The fixmultijoins flag is there to remove redundant overlaps that are sometimes created because Dedupe runs multit-hreaded, and two different threads can sometimes create the same overlap at the same time; it doesn't remove anything useful - just when you plot the graph afterward, there won't be any nodes connected by duplicate edges.

"out" will get ALL reads that survived deduplication, regardless of whether they are clustered. However, I do have another program, filterbyname.sh, that can remove all sequences from a file that either share or don't share names with sequences in another file:

filterbyname.sh in=a.fq names=b.fq out=c.fq include=f

That would yield c.fq, containing the intersection of a.fq and b.fq. So if a.fq were clustered reads and b.fq were all reads, then c.fq would only contain unclustered reads. Names can also be a comma-delimited list of files.

A typical command for clustering like you are doing is something like this:

dedupe.sh in=x.fq am=t ac=t fo c rnc=f mcs=2 mo=100 cc pto=t csf=stats.txt out=all.fq pattern=cluster_%.fq qin=33 dot=graph.dot

If you want the output graph to contain only maximal overlaps trees, then change pto=t to pto=f mst=t