Question

Piping Output From Find To Parallel

2

Entering edit mode

11.2 years ago

tayebwajb ▴ 120

I have tried to pipe output from the find command into parallel but I can't seem to figure out why it is not working as I expected.

Basically, I have three types of files in a directory i.e:

foo_1_dup.bam
foo_1_dup.bam.bai
foo_1_dup_recal_report

I want to use find to get *dup* but exclude *bai and pipe the output into parallel such that foo_1_dup.bam will be {1} and foo_1_dup_recal_report will be {2}.

I have many files such that each foo_[1..N]_dup.bam has its corresponding foo_[1..N]_dup_recal_report.

Here is the code I have tried:

find /dir/ \( -name \*dup\* ! -name \*bai\* \) | parallel --dryrun -N2 -j2 -k -v --progress --joblog recalibration_joblog --retries 2 --noswap "java -Xmx4g -jar GenomeAnalysisTK.jar -nct 4 -I {1} -BQSR {2} -R hg19.fasta -T PrintReads -o {1.}_recal.bam"

It works fine for the first round but then starts swapping {2} for {1}.

I have used find and piped output to parallel before and it worked fine but then I had only two types of files in the directory so I didn't have to exclude one. Could you please tell me what I am doing wrong or a better way to do it? Thanks in advance!

parallel bash • 5.0k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 11.2 years ago by tayebwajb ▴ 120

0

Entering edit mode

Please write a shell script, if you want to repeat this kind of work. You can work easily with shell scripts rather then writing a complicated one-liner. I will add a model script for your analysis as an answer, please write a script in your style after that.

ADD REPLY • link 11.2 years ago by aravind ramesh ▴ 540

score 5 · Answer 1 · 2013-05-15

First of all, let's simplify your example, removing all the options that are not related to the matter:

find /dir/ \( -name \*dup\* ! -name \*bai\* \) | parallel -N2 "java -jar GenomeAnalysisTK.jar -I {1} -BQSR {2} -o {1.}_recal.bam"

I think that the problem is that there is one "dup" file that does not have a correspondent "dup_recal_report", making parallel switch "{1}" with "{2"}. You should try to print the commands that are executed to a file, and manually inspect them.

Moreover , to remove the the "bai" files, I would use grep:

find /dir/ -iname "*dup*" | grep -v "bai"

Finally, if you know that for each "dup" file there is always a "dup_recal_report" file, I would apply find only to the dup files, and then use {.} to reconstruct the name of foo_1_dup_recal_report:

find /dir/ -iname "*dup*" | grep -v "bai" | grep -v "report" | parallel "echo java -jar GenomeAnalysisTK.jar -I {} -BQSR {.}_recal_report -o {.}_recal.bam"

score 0 · Answer 2 · 2013-05-15

0

Entering edit mode

11.2 years ago

aravind ramesh ▴ 540

Model script for solving your problem

step_1: Make a list of type {1} files. ls -1 "Whatever pattern" > list_1 (list all files only of {1})

Script:

for i in $(cat list_1)
do
File_1=$i;
File_2=$('write a sed oneliner to change the name of $i to required correponding {2} file here')
whatever_program $File_1 $File_2 respective command line arguments
done

Your script rocks. This might not be the best approach, but this is what I use often. If you dont follow the script, please comment, I will try to answer your query.

ADD COMMENT • link 11.2 years ago by aravind ramesh ▴ 540

1

Entering edit mode

this is not parallel

ADD REPLY • link 11.2 years ago by Jeremy Leipzig 22k

0

Entering edit mode

he can add parallel after finding the files, Last line of the script.

ADD REPLY • link 11.2 years ago by aravind ramesh ▴ 540

0

Entering edit mode

I would really prefer to use parallel because I find it amazing at load-balancing!!

ADD REPLY • link 11.2 years ago by tayebwajb ▴ 120

0

Entering edit mode

You've asked a better way to do it. I've given my answer. You can use parallel or any other program, replace the last line of the model script with parallel and its arguments including files.

ADD REPLY • link 11.2 years ago by aravind ramesh ▴ 540

1

Entering edit mode

I think you are missing the point. Yes you could certainly add a & to background each iteration but that effectively cripples the usage of gnu parallel. http://www.gnu.org/software/parallel/

ADD REPLY • link 11.2 years ago by Jeremy Leipzig 22k