Parallel for shell script with different output
7.9 years ago
Korsocius ▴ 200

Dear all,

I need help with parallel command. I have one script in shell. And I would like to run it 12time in same time. But for each script I need different name of output. output is .tsv and the name is same like name of input, could you help how to do that?

Thanks a lot

please post your standard shell command line;

the parallel command would be somthing like

parallel scriptname <hardcoded options for the script> {1}.input {1}.output ::: prefixes

Show some example of what is input and output.

I have script with name bin.sh where is input 1.bam and output will be 1.tsv. Every input is in the different folder with same name from 1-12. In this folder are 1.bam (imput), 1.bai (input) => they are reading with shell script. Output will 1.tsv and etc. for each bam file.

parallel myscript {} {.}.bai '>' {.}.tsv ::: */*.bam

7.9 years ago

an example with echo:

 seq 1 12  | parallel  echo "Hello" '>' 'result.{}'
Also look into --results for a structured way of organizing the output files.

7.9 years ago
for input in *.bam; do out=echo $input | awk -F"." '{ print$1}'; bin.sh $input$out.tsv & done


To understand:

for input in *.bam;  #for each bam file
do
out=echo $input | awk -F"." '{ print$1}' #get the uniq output prefix
bin.sh $input$out.tsv & #run the srcipt and push it to background
done

won't work if there are not enough cores.

Yes. But the for loop helps me a lot in other cases where the operation is computationally not expensive.

... and if something wrong happens, you'll have to (quickly) find & kill your PIDs ...

I used to do that.

I am trying to understand why people still use for loops for independent jobs.

Is it readability? Is the for-loop really easier to read than 'parallel bin.sh {} {.}.tsv ::: *.bam'? Or if the jobs were bigger/more complex using a function:

myfunc() {
bin.sh "$1" "$2"
#more stuff here
}
export -f myfunc
parallel myfunc {} {.}.tsv ::: *.bam


For computationally cheap jobs I really do not see the benefit of a for loop.

The only advantage I can think of is that GNU Parallel may not be installed. But that advantage can vanish in just 10 seconds: wget -O - pi.dk/3|bash

@Geek_y can you enlighten me, what you see as the advantage?

I am used to use for loop. But definitely will shift towards parallel. Started reading your tutorial on parallel. I am from biology background and a kind of beginner in core bioinfo, hence, need some time to learn best practices.

I wonder why one would dispatch computationally cheap operations to a bg core in the first place. The gain in execution time will surely be balanced out by the time taken to write the loop + dispatch to different cores.

Input files are in different folders, you might wanna use a find (with an optional maxdepth`) to find these files first, then run the script on them.