Calling Python Script From Bash Console While Using Gnu Parallel
1
3
Entering edit mode
10.8 years ago
NPalopoli ▴ 290

I am trying to run a python script which processes many FASTA files in parallel using GNU parallel 20110722. As it could be seen below, I am not able to run in any of the ways I tried. (^C marks the point where I interrupt the job with ctrl+C because there is no response from the system).

me@kubuntu:~/Programs/LeitMotifsParallel$parallel python {1} :::: <(echo MainMult.18136.py) ^C me@kubuntu:~/Programs/LeitMotifsParallel$ parallel python {1} ::: <(echo MainMult.18136.py)
File "/dev/fd/63", line 1
MainMult.18136.py
^
SyntaxError: invalid syntax
me@kubuntu:~/Programs/LeitMotifsParallel$parallel MainMult.18136.py parallel: Input is tty. Press CTRL-D to exit. ^C me@kubuntu:~/Programs/LeitMotifsParallel$


However, the python script runs as expected when run directly from console.

me@kubuntu:~/Programs/LeitMotifsParallel\$ MainMult.18136.py
/home/me/Programs/LeitMotifsParallel/StAlg.py:6: DeprecationWarning: the sets module is deprecated
import sets #@UnusedImport
Start : 18:59:34 11Aug2011
['M']
['M']
(...)


Though I have watched the two tutorials on GNU Parallel in youtube and have gone through the many examples of the README file, I haven't been succesful in finding an answer for this situation, so I would really appreciate if you could help me solve this.

python parallel bash • 11k views
2
Entering edit mode

Note that you can also use threading really easily from within Python: http://docs.python.org/library/multiprocessing.html

1
Entering edit mode

Not sure this is a bioinformatics question. What are the ":::" doing? did you try "parallel python -- your-prog.py"

0
Entering edit mode

Is it possible/necessary to put the python call into a separate bash file (e.g. run.sh)? It might be a problem with the input/output redirection or piping...

0
Entering edit mode

Thanks all for the suggestions. Though I don't think it could be strictly categorized as a bioinformatics question I assumed it may be relevant for the community.
brentp: The ":::" are used by GNU Parallel for specifying arguments from the command line. I tried with your option but the "Input from tty" line shows it is not useful.
cjt: I tried with calling python from a separate bash file and using parallel to call that file but it is not working either.
Michael Schubert: I would definitely use threading from within Python in other cases but for this job I need to use parallel.

10
Entering edit mode
10.8 years ago
tange ▴ 190

First it is good to see a fellow bioinformatician use GNU Parallel.

Secondly it is good to learn you have watched the 4 intro videos: http://www.youtube.com/watch?v=OpaiGYxkSuQ http://www.youtube.com/watch?v=P40akGWJ_gY http://www.youtube.com/watch?v=1ntxT-47VPA http://www.youtube.com/watch?v=fOX1EyHkQwc and that you have browsed through the examples: http://nd.gd/u3

GNU Parallel does not parallelize the internals of an existing program. What it can do, however, is call the same program with different arguments. So if you normally do:

MainMult.18136.py foo.fasta
MainMult.18136.py bar.fasta


you can parallelize this by running 2 MainMult.18136.py in parallel like this:

parallel MainMult.18136.py ::: foo.fasta bar.fasta


If your program has the filenames hardcoded in the program then GNU Parallel cannot parallelize the task for you. The filenames must be given on the command line.

Alternatively, if the program reads from standard input (stdin) and you normally do:

cat foo.fasta bar.fasta | MainMult.18136.py


then you can parallelize this by using GNU Parallel --pipe:

cat foo.fasta bar.fasta | parallel --pipe --block 10M --recstart '>'  MainMult.18136.py


This will chop the input into 10 MB sized chunks and pass them on to MainMult.18136.py on standard input (stdin). Each chunk will be chopped at a '>' which is where a FASTA record starts.

If what you normally do is:

MainMult.18136.py
MainMult.18137.py
MainMult.18138.py


that is, run a bunch of different programs that happen to have the arguments hardcoded in each of them, then you can do:

parallel ::: MainMult.*.py


or:

ls | grep MainMult | parallel


However, from your description of the Python script it seems you normally do:

MainMult.18136.py


(only one program with no arguments and no reading from standard input). In this situation GNU Parallel is unable to parallelize the task. My advice is to change the Python script so that it takes the filenames as its arguments from the command line.

0
Entering edit mode

I have seen the files and read the examples but I couldn't make my script work.

The FASTA files are directly called in the MainMult.18136.py file in the following way: 1) A variable is defined pointing to the file: work_file = "/home/me/Prog/Seqs.fasta" 2) The filename is specified as one of many parameters in the proper function call: Run("Seqs.fasta", 7, 100, True, 0.63, True, False, 0.98, False, False)

The point here is that the MainMult.18136.py runs from command line, but not if called with parallel.

0
Entering edit mode

@NPalopoli: Not sure I understand; are you iterating over lists of FASTA files and function arguments within the script itself?

0
Entering edit mode

The key is that (and I cite you), "GNU Parallel does not parallelize the internals of an existing program". I've made different python scripts for each Fasta file and managed to run the program by following your advice and calling: parallel ::: MainMult.*.py. Thanks a lot for your help!