Velvet's algorithms in theory work for any size. However, the engineering aspects
of Velvet, in particular memory consumption, means it's unable to handle read sets
of a particular size. This of course depends on how big a real memory machine
you have.
I know we have "routinely" (ie, for multiple strains) done Drosophila sized genomes
(~120MB) on a 125GB machine.
I've heard of Velvet being used into the 200-300MB region, but rarely further. Memory
size is not just about the size of the genome but also how error prone you reads
are (though sheer size is important).
Beyond this there are a variety of strategies:
"Raw" de Bruijn graphs, without a tremendously aggressive use of read pairs can
be made using Cortex (unpublished, from Mario Cacamo and Zam Iqbal) or ABySS (published,
well understood, from the BC genome centre).
Curtain (unpublished, but available, from Matthias Haimel at EBI) can do a
smart partition of the reads given an initial de Bruijn graph, run Velvet on the paritions
and thus provide an improved more read-pair aware graph. This can be iterated and in
at least some cases, the Curtain approach gets close to what Velvet can produce alone
(in the scenarios where Velvet can be run on a single memory machine to understand
Curtain's performance)
SOAP de novo from the BGI is responsible for a number of the published assemblies
(eg, Panda, YH) although like many assemblers, tuning it seems quite hard, and I would
definitely be asking the BGI guys for advice.
A new version of ALLPATHS (from the Broad crew) looks extremely interesting, but
is not quite released yet.
In all above the cases I know of successes, but also quite a few failures, and untangling
data quality/algorithm/choice of parameters/running bugs is really complex. So - whereas
assemblies < 100MB are "routine", currently assemblies 100MB-500MB are "challenging" and
500MB are theoretically doable, and have been done by specific groups, but I think still
are at the leading edge of development and one should not be confident of success for
"any particular genome".
No, I didn't see it before, thanks for reposting it. I've read the Panda paper, and it looks like they had to spend a lot of effort to get decent results - the initial assembly, based on 39x regular PE coverage had a meagre 1.5K n50. My efforts with SOAPdenovo aren't very positive, but it is using a de Bruijn approach, so perhaps it will fare better if I filter the data heavily? (my opinion is that de Bruijn is very difficult to get right, especially with noisy data)
Although I appreciate the answer, most of this is what I did not want: general/vague success stories from the developers :-)
Ketil, realize that -- but from what I can tell you are in mostly uncharted territory, something reflected by Ewan's comments. Short of hunting down people struggling with the same problem you'll probably have a hard time finding success stories. The SeqAnswers forum might be a better place of researchers in a similar situation.
Just a small update, I finally got around to try velvet, but it got killed after running out of memory. This was on a fraction of the data, and sadly, I only have 144GB RAM available. So velvet is out, I think.