Question: PacBio sequences, proovread correction with Illumina HiSeq reads
0
gravatar for jaimejr18
4.7 years ago by
jaimejr180
Spain
jaimejr180 wrote:

Hi

I want to correct PacBio sequences with Illumina Hiseq data using proovread tool, I can use 2x60 core machine with 3TB of Memory or another one with 16x24 with 512Gb. 

developers advise to chunk the smrt cell data into small files and use them to correct the PacBio data. I did and have 475 files with 50Mb each. I used proovread with one of the files and 50x coverage and took 12 hours in the large memory machine. But to correct all the files I have to run one by one and it's going to take very long time.

does anyone use this tool? could I run more than one PacBio file in each run to improve the total time?

Thanks 

ADD COMMENTlink modified 4.6 years ago • written 4.7 years ago by jaimejr180
1

I don't know about the tool you're using, but if you have those resources available, why not do the samples in parallel? What was the resource usage like when you tested it on one sample?

ADD REPLYlink written 4.7 years ago by andrew.j.skelton736.0k

Not sure, is why I ask if someone have used this tool, because I'm new in this kind of tools and I'm not sure how to run it.

I have restricted access to the machines, time and core restrictions, if I use 60 cores I can only use it for 12 hours and I can't run another job until this one finish, so 12hours x 475= 237 days...

ADD REPLYlink modified 9 months ago by RamRS30k • written 4.7 years ago by jaimejr180

Crossposted to SeqAnswers http://seqanswers.com/forums/showthread.php?t=65379

ADD REPLYlink written 4.7 years ago by Daniel Swan13k

yes it the same, I did it, should I retry one of them?

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by jaimejr180
0
gravatar for thackl
4.7 years ago by
thackl2.8k
MIT
thackl2.8k wrote:

Hi Jaime,

I had a look at the log you sent me (parsed according to grep command from https://github.com/BioInf-Wuerzburg/proovread#log-and-statistics)

[Mon Nov 30 17:38:30 2015] Running mode: sr
[Mon Nov 30 17:56:08 2015] Running task bwa-sr-1
[Wed Dec  2 17:20:54 2015] Masked : 81.1%
[Wed Dec  2 17:20:54 2015] Running task bwa-sr-2
[Fri Dec  4 00:23:09 2015] Masked : 89.1%
[Fri Dec  4 00:23:09 2015] Running task bwa-sr-3
[Sat Dec  5 12:03:45 2015] Masked : 91.2%
[Sat Dec  5 12:03:46 2015] Running task bwa-sr-finish
[Sun Dec  6 05:54:53 2015] Masked : 89.2%

Proovread ran 3 correction iterations (, which successively improve reads quality, with high quality corrected parts being "Masked") followed by the finish correction, which is mostly for polishing.

Your stats look pretty good. Getting up to 81.1% in the first iteration is great. It means that you get more than 81% percent of your data corrected right away. The second iteration gets you to 89.1, The third only to 91.2%. The default cutoff for proovread to stop iterating and start the polish step is either 92% masked or less than 3% gained compared to the previous iteration. The 92% are quite ambitious, in particular for large genome projects. In your case, the 3rd iteration also does not really get you anything but still takes 12h to run.

Therefore, my suggestion for your setup would be:

Don't aim for 92%, but rather something like 85%. This will save a lot of time (only two iterations) without loosing a noteworthy portion of your data. You can set this via a custom config. Put the following line in a file (my-proovread.cfg)

'mask-shortcut-frac' => 0.85

and call

proovread -c /path/to/my-proovread.cfg -l .. -s ..

If you compare your results to runs with lower illumina coverage (30x or 40x), you should aim for the same thing - >85% after the second iteration. If you can get that from 30x or 40x, then using lower coverages would increase runtime even further, if not, stick with 50x.

As for chunk size, I would try to optimize for maximum possible for you queue with respect to runtime limit and memory. Your queue has 24 core nodes with sufficient memory and a 144h per job limit. Given runtime of 24h for your test runs, you should be able to increase chunk size at least by a factor of 6 (144h/24h). On top, larger chunks will run a bit faster anyway and you will save time with the lowered mask-shortcut-frac cutoff. So my guess is that you should be able to run at least 500MB chunks on your normal_q (or also entire SMRT cells..)

Let me know, how it goes.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by thackl2.8k

Thomas,

I've been working with Jaime to setup his runs. I have some questions:

  1. If larger chunks are better/faster, should we max out the 3TB RAM (I can bump wall times for scaling runs)?
  2. If using 500 MB chunks as recommended, will the processes be able to utilize all 24 cores on a node? Should we spawn 2 processes on a node and have each use 12 cores or 4 processes using 6 cores? The nodes are setup in Cluster-On-Die mode so there are 4 NUMA nodes per machine.
  3. Is there some good metric for scaling like it is best to have X threads per every 100 MB of data? Or is it data-dependent?

Thank you,
Brian Marshall - Computational Scientist - Virginia Tech ARC

ADD REPLYlink modified 9 months ago by RamRS30k • written 4.7 years ago by mimarsh20

Hi mimarsh2, sorry, somehow I missed the alert about your post. I would not go with chunks >500Mb. This has nothing to do with performance, but with the biology of the sample. You shouldn't have chunks close to or larger than the genome size.

You can use 30 or 40 threads per jobs - scaling is (apart from a bit of single core stuff) data-independent.

ADD REPLYlink written 4.7 years ago by thackl2.8k
0
gravatar for jaimejr18
4.6 years ago by
jaimejr180
Spain
jaimejr180 wrote:

Hi Thomas,

Sorry for the delay, finally I correct all the sequences, here I print some info about, following your instructions I changed the config.cfg file to 0.85 (files 1 to 10), but still some of them went to bsa-sr step 4th even 5th consuming lot of time. Files 1 and 4 were tested to different coverage and 50x were the best in time and final corrected %. To save some time I modify config.cfg to 0.8 or less 5% gained to stop on files 11 to 40. I think those parameters are good to correct such amount of data, but I hope to hear from you soon, maybe with some improvements because those are only 1/4 of the total reads. The process takes a total of 27-30 days. Thanks!

ADD COMMENTlink modified 9 months ago by RamRS30k • written 4.6 years ago by jaimejr180

Yeah, I think reducing to 80%/5%-new masked makes sense for your data. I don't really have any further suggestions regarding improvements. Did you run the jobs on your "normal_q"? With 30 nodes, 3 days per chunk and 120 remaining chunk, it should take 10 days to complete correction, unless your queue is jammed...

ADD REPLYlink written 4.6 years ago by thackl2.8k

yes, finally I used the "normal_q" with 24 threads per job, because I can run 6 jobs at the same time with enough wall time to complete. At the "largemem_q" I didn't allow to have wall time enough to finish the job. The total of chunks was 48 / 6 jobs per time =8 x 3days=24 days in total approximately, some of them fail.

ADD REPLYlink modified 8 months ago by RamRS30k • written 4.6 years ago by jaimejr180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2035 users visited in the last hour