Minia is a low-memory short-read assembler for large genomes. It creates contigs.
DSK is a low-memory k-mer counter.
We have ported Minia and DSK to a new codebase that uses the GATB library. To make the change clear, from now on, Minia and DSK using the new codebase will have versions 2.x.x.
- Faster (multi-core parallelism)
- Slightly more accurate (has coverage information in the graph, for better discrimination between sequencing errors and polymorphism)
- Less disk usage (because of DSK)
- Can output unitigs
- Faster (multi-core parallelism)
- Less disk usage
- comparable performance to KMC2 (we're using their techniques :))
Download (Linux 64 bits):
For legacy, the final versions of Minia and DSK 1.xxx (old codebase) are http://minia.genouest.org/files/minia-1.6906.tar.gz and http://minia.genouest.org/dsk/dsk-1.6906.tar.gz.
However we recommend using the 2.x.x versions, as results are expected to be identical (in the case of DSK) or slightly better (Minia), while 2.x.x performance is significantly better (2x-4x) than 1.xxx versions.
You might be tempted to reply to this post in case you find a bug, or an installation problem, etc... But please make a new Biostar post instead:
Nice. I am trying it out now on some reads I assembled last night with Abyss to compare.
BTW, on my Ubuntu distro (12.04), I had to:
To get precompiled minia to run.
Thanks. Going to fix that shortly (DSK fixed already -- Minia compatible binaries coming). EDIT: done
DSK binary not working on centos5, also due to libstdc++.
Oh.. OK, let's see, I have re-created the 2.0.1 binaries (minia+dsk) using static linking (
-staticflag) and static linking of libstdc++ (
-static-libstdc++flag). It gave me a warning ("Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking") but the binary seems to work on several different machines.
I don't have any centos5 machine but out of curiosity I tested using Docker:
(inside the docker image:)
and it didn't complain about glibc or libstdc++.
Just for clarity (if anyone is confused by these command lines), there is no need to go through all of this to run DSK: the binary should work on linux 64 bits right away. This was just to illustrate how to test a program on Centos5.
FATAL: kernel too oldon a centos5 VirtualBox. Probably docker won't solve kernel problems. I have compiled a version here on centos5. Most broad machines are centos5, so I care. I run a clean centos5 VirtualBox just for compiling.
Thanks, good to know that Docker isn't sufficient for kernel compatibility.
I've compiled a new release (that includes minor bugfixes), DSK/Minia 2.0.2, using a centos5 virtualbox.
What I like about kmc2 is that it provides relatively standalone lightweight APIs to access the k-mer count files. I can embed several c++ files directly into my source code and forget about extra dependencies. I assume to read dsk counts, I have to use the entire gatb?
That's a good point.. the answer is "yes" as of today.
The output of DSK is in HDF5 format. As @edrezen just told me, even if we remove the GATB dependency for parsing DSK results, you'd still need a HDF5 parser. At this point, since the hdf5 library is quite big, one might as well include the whole GATB.
If a developer is serious about parsing DSK results inside his software, please get in touch with us, I'm sure we can work something out (such as making DSK return an easy-to-parse, non-HDF5 output format). However I'm missing a clear picture of an actual use case: if a developer has to parse DSK output (or KMC for that matter), is he packaging the source, or a binary, of DSK (resp. KMC) along?
I use KMC2 for toy projects. I ask users to download and run the official KMC2 by themselves. I don't package the KMC2 binary. I only use several of its files to read KMC2 k-mer counts. Bless, an error corrector, uses KMC2, too. It packages all the KMC2 source code as it has modified KMC2 to support MPI. Bless calls its own version of KMC2. It does not work with the official KMC2. Lightweight API to access k-mer counts is of course not essential, but having this will encourage other developers to use dsk.
Oh I see.. also your error correction tool BFC (the KMC2 branch) provides a concrete example.
Didn't know about Bless' KMC2 modification, nice! For anyone interested (probably Guillaume will be), here is the diff between the kmer_counter folders of original KMC2 and Bless':