Question: PacBio sequences, proovread correction with Illumina HiSeq reads
0
gravatar for jaimejr18
3.8 years ago by
jaimejr180
Spain
jaimejr180 wrote:

Hi

I want to correct PacBio sequences with Illumina Hiseq data using proovread tool, I can use 2x60 core machine with 3TB of Memory or another one with 16x24 with 512Gb. 

developers advise to chunk the smrt cell data into small files and use them to correct the PacBio data. I did and have 475 files with 50Mb each. I used proovread with one of the files and 50x coverage and took 12 hours in the large memory machine. But to correct all the files I have to run one by one and it's going to take very long time.

does anyone use this tool? could I run more than one PacBio file in each run to improve the total time?

Thanks 

ADD COMMENTlink modified 3.7 years ago • written 3.8 years ago by jaimejr180
1

I don't know about the tool you're using, but if you have those resources available, why not do the samples in parallel? What was the resource usage like when you tested it on one sample?

ADD REPLYlink written 3.8 years ago by andrew.j.skelton735.8k

Not sure, is why I ask if someone have used this tool, because I'm new in this kind of tools and I'm not sure how to run it.

I have restricted access to the machines, time and core restrictions, if I use 60 cores I can only use it for 12 hours and I can't run another job until this one finish, so 12hours x 475= 237 days... 

ADD REPLYlink written 3.8 years ago by jaimejr180

Crossposted to SeqAnswers http://seqanswers.com/forums/showthread.php?t=65379

ADD REPLYlink written 3.8 years ago by Daniel Swan13k

yes it the same, I did it, should I retry one of them?

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by jaimejr180
0
gravatar for thackl
3.8 years ago by
thackl2.7k
MIT
thackl2.7k wrote:

Hi Jaime,

I had a look at the log you sent me (parsed according to grep command from https://github.com/BioInf-Wuerzburg/proovread#log-and-statistics)

[Mon Nov 30 17:38:30 2015] Running mode: sr
[Mon Nov 30 17:56:08 2015] Running task bwa-sr-1
[Wed Dec  2 17:20:54 2015] Masked : 81.1%
[Wed Dec  2 17:20:54 2015] Running task bwa-sr-2
[Fri Dec  4 00:23:09 2015] Masked : 89.1%
[Fri Dec  4 00:23:09 2015] Running task bwa-sr-3
[Sat Dec  5 12:03:45 2015] Masked : 91.2%
[Sat Dec  5 12:03:46 2015] Running task bwa-sr-finish
[Sun Dec  6 05:54:53 2015] Masked : 89.2%

Proovread ran 3 correction iterations (, which successively improve reads quality, with high quality corrected parts being "Masked") followed by the finish correction, which is mostly for polishing.

Your stats look pretty good. Getting up to 81.1% in the first iteration is great. It means that you get more than 81% percent of your data corrected right away. The second iteration gets you to 89.1, The third only to 91.2%. The default cutoff for proovread to stop iterating and start the polish step is either 92% masked or less than 3% gained compared to the previous iteration. The 92% are quite ambitious, in particular for large genome projects. In your case, the 3rd iteration also does not really get you anything but still takes 12h to run.

Therefore, my suggestion for your setup would be:

Don't aim for 92%, but rather something like 85%. This will save a lot of time (only two iterations) without loosing a noteworthy portion of your data. You can set this via a custom config. Put the following line in a file (my-proovread.cfg)

'mask-shortcut-frac' => 0.85

and call

proovread -c /path/to/my-proovread.cfg -l .. -s ..

If you compare your results to runs with lower illumina coverage (30x or 40x), you should aim for the same thing - >85% after the second iteration. If you can get that from 30x or 40x, then using lower coverages would increase runtime even further, if not, stick with 50x.

As for chunk size, I would try to optimize for maximum possible for you queue with respect to runtime limit and memory. Your queue has 24 core nodes with sufficient memory and a 144h per job limit. Given runtime of 24h for your test runs, you should be able to increase chunk size at least by a factor of 6 (144h/24h). On top, larger chunks will run a bit faster anyway and you will save time with the lowered mask-shortcut-frac cutoff. So my guess is that you should be able to run at least 500MB chunks on your normal_q (or also entire SMRT cells..)

Let me know, how it goes.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by thackl2.7k

Thomas,

I've been working with Jaime to setup his runs.  I have some questions:

1) If larger chunks are better/faster, should we max out the 3TB RAM (I can bump wall times for scaling runs)?

2) If using 500 MB chunks as recommended, will the processes be able to utilize all 24 cores on a node?  Should we spawn 2 processes on a node and have each use 12 cores or 4 processes using 6 cores?  The nodes are setup in Cluster-On-Die mode so there are 4 NUMA nodes per machine.

3) Is there some good metric for scaling like  it is best to have X threads per every 100 MB of data?  Or is it data-dependent?

Thank you,

Brian Marshall - Computational Scientist - Virginia Tech  ARC

 

 

ADD REPLYlink written 3.8 years ago by mimarsh20

Hi mimarsh2, sorry, somehow I missed the alert about your post. I would not go with chunks >500Mb. This has nothing to do with performance, but with the biology of the sample. You shouldn't have chunks close to or larger than the genome size.

You can use 30 or 40 threads per jobs - scaling is (apart from a bit of single core stuff) data-independent.

ADD REPLYlink written 3.7 years ago by thackl2.7k
0
gravatar for jaimejr18
3.7 years ago by
jaimejr180
Spain
jaimejr180 wrote:

Hi Thomas,

Sorry for the delay, finally I correct all the sequences, here I print some info about, following your instructions I changed the config.cfg file to 0.85 (files 1 to 10), but still some of them went to bsa-sr step 4th even 5th consuming lot of time. Files 1 and 4 were tested to different coverage and 50x were the best in time and final corrected %. To save some time I modify config.cfg to 0.8 or less 5% gained to stop on files 11 to 40. I think those parameters are good to correct such amount of data, but I hope to hear from you soon, maybe with some improvements because those are only 1/4 of the total reads. The process takes a total of 27-30 days. Thanks!

          bwa-sr (%)  
File_size M Coverage File number Size *.fa Size *.fq 1st 2nd 3rd 4th 5th final wall time 
500 30x _01 198 390 70,5 84,8 88,5     85,5 78:42:00
" 50x _01 197 389 69,8 84,7 88,4     85,1 69:51:00
" 70x _01 197 388 68,8 84,4 88,2     84,8 63:33:00
" 30x _02 189 372 61,8 78,8 83,9 89,4   82,7 102:36:00
" 30x _03 186 365 58,8 77,0 82,5 88,2   81,4 101:14:00
" 20x _04 193 381 67,9 83,2 87,1     82,0 56:40:00
" 30x _04 195 386 70,0 83,9 87,6     84,8 78:22:00
" 50x _04 195 385 69,2 83,6 87,4     84,2 68:13:00
" 30x _05 183 360 59,2 76,2 81,8 87,6   81,0 100:35:00
" 30x _06 179 353 53,5 72,4 78,7 84,9 87,1 79,8 134:13:00
" 50x _07 187 368 59,8 77,6 82,8 88,5   81,3 90:28:00
" " _08 193 381 65,9 82,4 86,8     83,5 67:56:00
" " _09 190 374 62,3 80,3 85,3     81,7 66:50:00
" " _10 181 355 53,8 74,9 81,2 87,3   78,9 61:43:00
" " _11 189 373 70,7 83,9       82,4 45:10:00
" " _12 177 348 56,1 75,5 81,6     78,8 64:40:00
" " _13 178 350 53,3 73,0 79,5 85,8   78,9 86:31:00
" " _14 175 343 51,8 71,1 77,6 84,0   77,2 85:28:00
" " _15 184 362 63,9 81,0       79,4 44:44:00
" " _16 176 346 58,8 77,0 82,6     79,7 65:23:00
" " _17 178 349 53,0 73,0 79,7 86,0   79,0 86:50:00
" " _18 192 379 71,6 85,3       83,5 45:35:00
" " _19 190 375 76,8 86,6       85,1 46:07:00
" " _20 188 370 68,3 83,3       81,4 45:14:00
" " _21 175 345 56,1 75,3 81,4     78,5 64:19:00
" " _22 186 36 66,4 82,7       81,3 45:10:00
" " _23 176 346 55,6 74,5 80,6     77,8 64:45:00
" " _24 186 367 63,3 79,7 84,5     81,5 67:35:00
" " _25 188 371 71,1 83,9       82,3 46:14:00
" " _26 183 361 70,8 82,9       81,5 45:25:00
" " _27 176 347 56,9 74,8 80,6     77,9 64:51:00
" " _28 178 351 57,7 76,5 82,4     79,3 66:00:00
" " _29 175 344 56,2 74,7 80,7     77,9 64:20:00
" " _30 176 345 56,7 75,2 81,2     78,5 65:03:00
" " _31 174 342 55,1 74,2 80,4     77,6 63:40:00
" " _32 181 357 61,0 78,0 83,3     80,3 66:03:00
" " _33 184 363 71,0 83,7       82,1 45:23:00
" " _34 177 349 58,1 75,9 81,7     78,9 65:08:00
" " _35 174 342 54,6 73,7 80,0     77,3 64:31:00
" " _36 183 361 70,1 82,9       81,6 45:18:00
" " _37 182 358 64,0 79,5 84,3     81,3 66:41:00
" " _38 184 362 58,8 77,2 82,8     79,6 66:01:00
" " _39 188 371 72,2 84,4       82,8 45:31:00
" " _40 182 358 68,8 81,9       80,7 44:56:00
ADD COMMENTlink written 3.7 years ago by jaimejr180

Yeah, I think reducing to 80%/5%-new masked makes sense for your data. I don't really have any further suggestions regarding improvements. Did you run the jobs on your "normal_q"? With 30 nodes, 3 days per chunk and 120 remaining chunk, it should take 10 days to complete correction, unless your queue is jammed...

ADD REPLYlink written 3.7 years ago by thackl2.7k

yes, finally I used the "normal_q" with 24 threads per job, because I can run 6 jobs at the same time with enough wall time to complete. At the "largemem_q" I didn't allow to have wall time enough to finish the job. The total of chunks was 48 / 6 jobs per time =8 x 3days=24 days in total approximately, some of them fail.

ADD REPLYlink written 3.7 years ago by jaimejr180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1976 users visited in the last hour