I had a look at the log you sent me (parsed according to grep command from https://github.com/BioInf-Wuerzburg/proovread#log-and-statistics)
[Mon Nov 30 17:38:30 2015] Running mode: sr
[Mon Nov 30 17:56:08 2015] Running task bwa-sr-1
[Wed Dec 2 17:20:54 2015] Masked : 81.1%
[Wed Dec 2 17:20:54 2015] Running task bwa-sr-2
[Fri Dec 4 00:23:09 2015] Masked : 89.1%
[Fri Dec 4 00:23:09 2015] Running task bwa-sr-3
[Sat Dec 5 12:03:45 2015] Masked : 91.2%
[Sat Dec 5 12:03:46 2015] Running task bwa-sr-finish
[Sun Dec 6 05:54:53 2015] Masked : 89.2%
Proovread ran 3 correction iterations (, which successively improve reads quality, with high quality corrected parts being "Masked") followed by the finish correction, which is mostly for polishing.
Your stats look pretty good. Getting up to 81.1% in the first iteration is great. It means that you get more than 81% percent of your data corrected right away. The second iteration gets you to 89.1, The third only to 91.2%. The default cutoff for proovread to stop iterating and start the polish step is either 92% masked or less than 3% gained compared to the previous iteration. The 92% are quite ambitious, in particular for large genome projects. In your case, the 3rd iteration also does not really get you anything but still takes 12h to run.
Therefore, my suggestion for your setup would be:
Don't aim for 92%, but rather something like 85%. This will save a lot of time (only two iterations) without loosing a noteworthy portion of your data. You can set this via a custom config. Put the following line in a file (my-proovread.cfg)
'mask-shortcut-frac' => 0.85
proovread -c /path/to/my-proovread.cfg -l .. -s ..
If you compare your results to runs with lower illumina coverage (30x or 40x), you should aim for the same thing - >85% after the second iteration. If you can get that from 30x or 40x, then using lower coverages would increase runtime even further, if not, stick with 50x.
As for chunk size, I would try to optimize for maximum possible for you queue with respect to runtime limit and memory. Your queue has 24 core nodes with sufficient memory and a 144h per job limit. Given runtime of 24h for your test runs, you should be able to increase chunk size at least by a factor of 6 (144h/24h). On top, larger chunks will run a bit faster anyway and you will save time with the lowered mask-shortcut-frac cutoff. So my guess is that you should be able to run at least 500MB chunks on your normal_q (or also entire SMRT cells..)
Let me know, how it goes.
modified 4.7 years ago
4.7 years ago by
thackl • 2.8k