Piping Output From Find To Parallel
2
2
Entering edit mode
9.2 years ago
tayebwajb ▴ 110

I have tried to pipe output from the find command into parallel but I can't seem to figure out why it is not working as I expected.

Basically, I have three types of files in a directory i.e:

• foo_1_dup.bam
• foo_1_dup.bam.bai
• foo_1_dup_recal_report

I want to use find to get dup but exclude *bai and pipe the output into parallel such that foo_1_dup.bam will be {1} and foo_1_dup_recal_report will be {2}.

I have many files such that each foo_[1..N]_dup.bam has its corresponding foo_[1..N]_dup_recal_report.

Here is the code I have tried:

find /dir/ $$-name \*dup\* ! -name \*bai\*$$ | parallel --dryrun -N2 -j2 -k -v --progress --joblog recalibration_joblog --retries 2 --noswap "java -Xmx4g -jar GenomeAnalysisTK.jar -nct 4 -I {1} -BQSR {2} -R hg19.fasta -T PrintReads -o {1.}_recal.bam"


It works fine for the first round but then starts swapping {2} for {1}.

I have used find and piped output to parallel before and it worked fine but then I had only two types of files in the directory so I didn't have to exclude one. Could you please tell me what I am doing wrong or a better way to do it? Thanks in advance!

parallel bash bioinformatics • 3.8k views
0
Entering edit mode

Please write a shell script, if you want to repeat this kind of work. You can work easily with shell scripts rather then writing a complicated one-liner. I will add a model script for your analysis as an answer, please write a script in your style after that.

5
Entering edit mode
9.2 years ago

First of all, let's simplify your example, removing all the options that are not related to the matter:

find /dir/ $$-name \*dup\* ! -name \*bai\*$$ | parallel -N2 "java -jar GenomeAnalysisTK.jar -I {1} -BQSR {2} -o {1.}_recal.bam"


I think that the problem is that there is one "dup" file that does not have a correspondent "dup_recal_report", making parallel switch "{1}" with "{2"}. You should try to print the commands that are executed to a file, and manually inspect them.

Moreover , to remove the the "bai" files, I would use grep:

find /dir/ -iname "*dup*" | grep -v "bai"


Finally, if you know that for each "dup" file there is always a "dup_recal_report" file, I would apply find only to the dup files, and then use {.} to reconstruct the name of foo_1_dup_recal_report:

find /dir/ -iname "*dup*" | grep -v "bai" | grep -v "report" | parallel "echo java -jar GenomeAnalysisTK.jar -I {} -BQSR {.}_recal_report -o {.}_recal.bam"

1
Entering edit mode

I think you are right, there is a "dup" file that does not have a corresponding "dup_recal_report" file. I am not interested in it but every time I tried to remove it, it said "text file busy"! Somehow I had forgotten about it. I used grep -v after the find command and used my original approach so it worked! Thank you very much for pointing this out otherwise I might have spent too much time wondering why my approach was not working.

0
Entering edit mode
9.2 years ago

Model script for solving your problem

step_1: Make a list of type {1} files. ls -1 "Whatever pattern" > list_1 (list all files only of {1})

Script:

for i in $(cat list_1) do File_1=$i;
File_2=$('write a sed oneliner to change the name of$i to required correponding {2} file here')
whatever_program $File_1$File_2 respective command line arguments
done


Your script rocks. This might not be the best approach, but this is what I use often. If you dont follow the script, please comment, I will try to answer your query.

1
Entering edit mode

this is not parallel

0
Entering edit mode

he can add parallel after finding the files, Last line of the script.

0
Entering edit mode

I would really prefer to use parallel because I find it amazing at load-balancing!!

0
Entering edit mode

You've asked a better way to do it. I've given my answer. You can use parallel or any other program, replace the last line of the model script with parallel and its arguments including files.

1
Entering edit mode

I think you are missing the point. Yes you could certainly add a & to background each iteration but that effectively cripples the usage of gnu parallel. http://www.gnu.org/software/parallel/