Tool: BBSketch - A Tool for Rapid Sequence Comparison
gravatar for Brian Bushnell
2.3 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

MinHash Sketch is a method of rapidly comparing large strings or sets. In genomics, you can use it like this:

1) Gather all the kmers in a genome.

2) Apply a hash function to them.

3) Keep the 10000 smallest hashcodes and call this set a "sketch".

If you do this for multiple genomes, you can calculate how similar two genomes are much faster than via alignment; and specifically, the greatest advantage is that the speed of sketch comparison is unrelated to genome size. For example, if you assemble something, you can sketch it and compare it to sketches of everything in nt or RefSeq in a couple seconds. If the top hit shows that your new sketch shared 90% of its hash codes with E.coli, that means it has roughly 90% kmer identity to E.coli, and therefore, it is E.coli.

The BBMap package has an extremely efficient MinHash Sketch implementation, now accessible through 4 programs. The latest, just released in BBMap 36.92, is SendSketch: in=contigs.fa

This will make a sketch of your assembly, open an HTTP connection to JGI's taxonomy server, and send the sketch. The server will compare it to sketches of all of nt, and return the top hits. This is kind of convenient because the sketches sit around in memory all the time rather than needing to be loaded, so you get the answer in a couple seconds. You can also add the flag "refseq" to compare to RefSeq instead of nt. Additionally, SendSketch works on raw reads (including Illumina, PacBio, and Nanopore). With Illumina the files can be huge and you don't need all of the reads so I normally run it like this: in=reads.fq reads=400k limit it to the first 400,000 reads. Raw PacBio has a high error rate so I typically increase the default sketch size to increase sensitivity: in=PacBio.fq size=100000

You can also run a local server with "" if you want to use a different database for SendSketch.

The other tools are for processing sketches locally. To sketch and query refseq, for example: in=refseq.fa.gz out=refseq#.sketch files=31 mode=sequence

That will produce one sketch per sequence, and split them among 31 output files (for rapid loading). Alternatively, and preferably, you can produce one sketch per taxID. This process is a bit more involved, but the full details are given in the BBMap package in the /bbmap/pipelines/ directory, in these example shellscripts:

Now that you have created some sketches, you can query them like this: in=contigs.fa refseq*.sketch

"in=" can be a comma-delimited list of either fasta or sketch files. Also, you can turn an input fasta into one sketch per sequence rather than just a single sketch with the "mode=sequence" flag. If you want to process with CompareSketch using taxonomic information, add "tree=taxtree.gz" as well (which was created by the "" script). SendSketch automatically uses taxonomy on the server-side.

So, that's how you run Sketch. Now, what does the output mean? Here's an example, using E.coli:

The complete description of columns is in /bbmap/docs/guides/BBSketchGuide.txt, but briefly, "matches" is the number of matching kmers between the query and reference sketch. KID (or kmer identity) is simply the percent of the kmers that matched, while WKID is the weighted kmer identity, which has been adjusted to compensate for genome size. The ANI, completeness, and contamination are estimates, comparing the query to that reference.

Titus Brown suggested that I cite prior art, since this is not a novel idea, so:

The MinHash concept, which is about 20 years old:



blast tool bbmap sketch minhash • 3.5k views
ADD COMMENTlink modified 22 months ago • written 2.3 years ago by Brian Bushnell16k

Nice. I think this is how miniasm does it's alignments also?

ADD REPLYlink written 2.3 years ago by Damian Kao15k

Oh, I'm not sure; I'll have to examine miniasm in more detail. Just skimming, it looks like, "maybe" or "something similar". But Mash and SourMash certainly do use this principle.

ADD REPLYlink written 2.3 years ago by Brian Bushnell16k

Canu runs MinHash too for the first alignment of noisy reads, see Fig. 1 in 'Assembling large genomes with single-molecule sequencing and locality-sensitive hashing'

There's also kWIP, which works differently, but arrives at similar results and can be used for the same things:

ADD REPLYlink written 2.3 years ago by Philipp Bayer6.0k

This is great. Thanks! I see documentation for building sketches by sequence or taxon -- do you have a recipe for building a reference database where the sketches each represent an assembly/genome?

ADD REPLYlink modified 19 months ago • written 19 months ago by Charles Pepe-Ranney0

Currently I use taxa as a proxy for assembly or genome. There is no "make one sketch per file" mode, which would be quite useful and I keep meaning to add it. But there is "one sketch" mode, so if each assembly is in a single file, you can loop over them and run one BBSketch process per file to generate one sketch per file. Then concatenate the sketches. Sketches are just text files and a .sketch file can contain one or more sketches, so concatenating them is fine.

ADD REPLYlink written 19 months ago by Brian Bushnell16k

What is the default kmer size for BBSketch?

ADD REPLYlink written 13 months ago by san620
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 860 users visited in the last hour