Question: How To Parallelize Codeml With Python
4
gravatar for Biojl
7.2 years ago by
Biojl1.6k
Barcelona
Biojl1.6k wrote:

Hi,

I was wondering if someone already succeded to do some PAML (Dn/Ds) / branch site tests using multicore computers. I normally use http://www.parallelpython.com/

But the problem is that the CODEML programme creates some temporary files in the folder where is located, so I can't run more than one at a time. It may be possible to change the working path every time we run a CODEML copy?

Any help will be welcome :)

python parallel paml codeml • 4.4k views
ADD COMMENTlink modified 6.8 years ago by mgalactus720 • written 7.2 years ago by Biojl1.6k
2
gravatar for Lars Arvestad
7.2 years ago by
Lars Arvestad80 wrote:

The PAML code is very convoluted, so don't expect to be able to go in and fix the program to Do the Right Thing.

I would write a small wrapper to codeml that created a temp directory in which codeml was run. That way, you should be able to run as many codeml instanes as you want in parallel.

ADD COMMENTlink written 7.2 years ago by Lars Arvestad80

Basically that's what I'm asking in the question...

ADD REPLYlink written 7.2 years ago by Biojl1.6k

Yes, a python or shell script wrapper would be useful in this instance. I hope that either these jobs are small or you're running these multiple instances on a cluster/GRID system.

ADD REPLYlink written 7.2 years ago by Liam Thompson120
2
gravatar for Steve Moss
7.1 years ago by
Steve Moss2.2k
United Kingdom
Steve Moss2.2k wrote:

Check out this article :-)

gcodeml: A Grid-enabled Tool for Detecting Positive Selection in Biological Evolution

ADD COMMENTlink written 7.1 years ago by Steve Moss2.2k

Thanks! Great resource

ADD REPLYlink written 6.5 years ago by Biojl1.6k
2
gravatar for mgalactus
6.6 years ago by
mgalactus720
United Kingdom
mgalactus720 wrote:

I've successefully used the python multiprocessing module.

You can find the working example here: https://gist.github.com/3743820

ADD COMMENTlink written 6.6 years ago by mgalactus720
1

Do you have a clear example on how it works? I can't figure it out from the code, nor is the example working

ADD REPLYlink written 6.5 years ago by Biojl1.6k
1

Yeah, you are right, the script lacks some explanations:

  • First of all, you need to prepare a directory with all your alignments in phylip format (something like GENE_ID.phy)
  • Then you have to create a directory called "tmpcodeml" (in the same directory as the previous one) in which you must prepare one file for each one of the alignments in the previous directory (named like GENE_ID.phy.ctl): these files are the codeml parameters files (you can easily create them using biopython).
  • Launch the script as follows: python parallelPAML.py -a FIRST_DIRECTORY -t TREE.nwk -r 7 (the -r option indicates how many CPUs you want to use)
  • At the end your tmpcodeml directory will contain a bunch of output files (.out, .rst and .rst1)

Sorry if it looks a bit convoluted, but it was written in a rush :)

ADD REPLYlink written 6.5 years ago by mgalactus720

This is really helpful. Just a quick question. Do I have to change NSites value at line number 36 in the script to run M7 and M8 models?

Thanks

ADD REPLYlink written 5.2 years ago by baikalevi10
1
gravatar for Liam Thompson
7.2 years ago by
Liam Thompson120
Gothenburg, Sweden
Liam Thompson120 wrote:

I do not know of any way to have parallel PAML computations on one dataset. I spent a fair amount of time looking into this and was assured by computational experts that I would need to reprogram the C(?) based PAML to allow for multi-processor or parallel computation support. I don't think this is going to happen anytime soon.

I managed by reducing the size of the datasets and submitting the jobs to the queue system of our university cluster. That way jobs were automatically loaded once resources were free, although the computational analysis of the individual jobs remained slow.

ADD COMMENTlink written 7.2 years ago by Liam Thompson120

With parallelpython is possible to parallelize any command line task. In the sense of automatically splitting the jobs, avoiding to make several datasets. I already did that with alignemnt programmes (mafft, prank...). So It's not truly a parallel PAML what I seek, but to be able to run several instances of PAML in different processors. Hope it is clearer now.

ADD REPLYlink written 7.2 years ago by Biojl1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1181 users visited in the last hour