Question: popoolation parallelization problem
1
gravatar for dovi
23 months ago by
dovi60
dovi60 wrote:

Hi all,

I am using popoolation for my pool-seq data to calculate fst values using fst-sliding.pl script. As the synchronized file weights ~2TB, I though that it would be better to parallelize the process. I've therefore split the synchronized file into small chunks and I run multiple jobs (max of 50 jobs) at the same time in a cluster. Each job should take ~1 day. However when I start running multiple jobs I see that about ~20 output files start being filled (increase size file of popoolation output .fst) but the others remain empty until these first have finished. I know that all of them actually started because I can see the .params file for all of them. I have no clue why is this happening, I thought maybe is because all jobs call the same perl script? and somehow can't handle so many calls?? Did someone experienced that before or have any clue how to solve it?

Thanks!

ADD COMMENTlink modified 23 months ago • written 23 months ago by dovi60
1

For those wondering (like I did) popoolation is a NGS data analysis pipeline for comparing population allele frequencies.

ADD REPLYlink modified 23 months ago • written 23 months ago by genomax65k
1

have you tried looking at temporary files that popoolation generates, or better yet at the code to see how it generates them?

ADD REPLYlink written 23 months ago by Vincent Laufer1.0k

It doesn't seem to generate a temporary file, for what I see in the code it directly prints the result into the output file

ADD REPLYlink written 23 months ago by dovi60
1

hmm. if the problem isnt with file handling in their scripts, my next most likely guesses are:

1) system resource requests on your cluster 2) file handling in your scripts 3) system resource allocation on your cluster

ADD REPLYlink written 23 months ago by Vincent Laufer1.0k

I thought maybe is because all jobs call the same perl script? and somehow can't handle so many calls??

Sounds unlikely. But perhaps the load on the server got too big?

ADD REPLYlink written 23 months ago by WouterDeCoster38k

The cluster has the load manager MOAB, which schedules and manages all jobs, so I understand that if a job is running is because is not overloading the server and it can run properly (otherwise stays as 'waiting'). Or is not this what you mean?

ADD REPLYlink written 23 months ago by dovi60

I'm not sure. I'm not familiar with popoolation or MOAB.

ADD REPLYlink written 23 months ago by WouterDeCoster38k

I know nothing about poopolation but are you sure you are able to do this sort of brute force parallelization? You may want to ask the program authors just to be sure.

ADD REPLYlink written 23 months ago by genomax65k

Just following up this threat, have you figured it out? I'm looking to do something similar. Can you explain how you implemented the parallelisation? Did you breakup the sync file in chunks and send individual jobs?

ADD REPLYlink written 22 months ago by hern.moral0

Well, I think my problem was related to the cluster I was using, sometimes jobs run as expected (talking about time) and sometimes they were 'delayed' and had to run them again (as they exceeded the 3 days time, that is the limit time in my cluster). I did the parallelisation as you say, I split the sync files into chunks and then send several individual jobs.

ADD REPLYlink modified 22 months ago • written 22 months ago by dovi60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1381 users visited in the last hour