Question: Is it possible to run haplotypecaller in gnu parallel ?
2
gravatar for dare_devil
5 weeks ago by
dare_devil1.4k
India
dare_devil1.4k wrote:

I have a bash script about HaplotypeCaller.sh. Is it possible to run this in gnu parallel ?

cat HaplotypeCaller.sh | parallel

#!/bin/bash

for i in *.bam; do 
gatk --java-options "-Xmx40g" HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I $i -O $i.vcf.gz --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list; done
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by dare_devil1.4k
1

Something like this should work

find . -name '*.bam' | rev | cut -c4- | parallel -I{} gatk --java-options "-Xmx40g" HaplotypeCaller \
    -R /media/gatk/Homo_sapiens_assembly38.fasta \
    -I {}.bam \
    -O {}.vcf.gz \
    --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz \
    -L /media/gatk/target/ag_V6_list.interval_list
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by 4galaxy77290

I get following error while running above command

Error was: .detros_1C100/..bam with exception: Cannot read non-existent file:
ADD REPLYlink written 5 weeks ago by dare_devil1.4k

@ 4galaxy77 rev | cut -c4- removes the .bam extension. But {.} in parallel removes the extension.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by cpad011215k
1

see if following prints exact commands you want to run and then remove --dry-run option

parallel --dry-run 'gatk --java-options "-Xmx40g" HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I {} -O {.}.vcf.gz --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list' ::: *.bam

make sure that system has enough resources to execute parallel commands. Other wise, limit memory / threads/ cpus.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by cpad011215k

use a workflow manager.

ADD REPLYlink written 5 weeks ago by Pierre Lindenbaum134k
4
gravatar for ole.tange
5 weeks ago by
ole.tange4.0k
Denmark
ole.tange4.0k wrote:
#!/bin/bash

do_one() {
  gatk --java-options "-Xmx40g" HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I "$1" -O "$1".vcf.gz --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list
}
export -f do_one
# test that do_one works on a single bam file. Then:
parallel do_one ::: *.bam

If you do not want the output to be named foo.bam.vcf.gz but instead foo.vcf.gz:

#!/bin/bash

do_one() {
  gatk --java-options "-Xmx40g" HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I "$1" -O "$2" --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list
}
export -f do_one
# test that do_one works on a single bam file:
#   do_one foo.bam foo.vcf.gz
# Then:
parallel do_one {} {.} ::: *.bam
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by ole.tange4.0k
2

As written above, the output files will have .bam.vcf.gz extension, where retaining the .bam was probably not desirable, and is why the OP's pipeline includes | cut -c4-.

I would not use a bash function in this case, and would rather write:

parallel gatk --java-options "-Xmx40g" HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I {} -O {.}.vcf.gz --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list ::: *.bam

Taking advantage of parallel's {.} variable which is the file without its last extension (in this case, trimming the trailing .bam).

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Malcolm.Cook1.2k

Cheers !! this works!!

ADD REPLYlink written 5 weeks ago by dare_devil1.4k

This gives me an error export: Illegal option -f

ADD REPLYlink written 5 weeks ago by dare_devil1.4k

What is your shell? Something other than bash? In any case, see my other comment. You don't need a function at all.

ADD REPLYlink written 5 weeks ago by Malcolm.Cook1.2k

Are you writing exactly what I wrote? Because if you change #!/bin/bash to #!/bin/sh then it will not work.

ADD REPLYlink written 5 weeks ago by ole.tange4.0k

Ya I have taken the same code provided by you. It gives an error export: Illegal option -f

ADD REPLYlink written 5 weeks ago by dare_devil1.4k

Update: When I executed your code one by one it started running. But When I saved your code in a file as parallel_haplo.sh it does not work.

sh parallel_haplo.sh It gives the error: parallel_haplo.sh : 6: export: Illegal option -f

It works when I run as bash parallel_haplo.sh. I don't know what made the difference

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by dare_devil1.4k
2
gravatar for dare_devil
5 weeks ago by
dare_devil1.4k
India
dare_devil1.4k wrote:

The answer given by Malcolm.Cook, ole.tange and cpad0112 works well

Method 1:

parallel gatk --java-options "-Xmx40g" HaplotypeCaller \
-R /media/gatk/Homo_sapiens_assembly38.fasta -I {} -O {.}.vcf.gz \
 --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz \
-L /media/gatk/target/ag_V6_list.interval_list ::: *.bam

Method 2:

#!/bin/bash

do_one() {
  gatk --java-options "-Xmx40g" HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I "$1" -O "$1".vcf.gz --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list
}
export -f do_one
# test that do_one works on a single bam file. Then:
parallel do_one ::: *.bam
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by dare_devil1.4k
2
gravatar for Pierre Lindenbaum
5 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

nextflow (not tested)

params.ref="/media/gatk/Homo_sapiens_assembly38.fasta"
params.bams=""
params.dbsnp="/media/gatk/dbsnp_138.hg38.vcf.gz"
params.intervals="/media/gatk/target/ag_V6_list.interval_list"

Channel.fromPath(params.bams).splitCsv(header: false,sep:',',strip:true).map{T->file(T[0])}.set{bamfiles}

process hapCaller {
tag "${bam.name}"
memory "40g"
input:
   file bam from bamfiles
output:
   file("${bam.getBaseName()}.vcf.gz") into vcf
script:
"""
gatk --java-options " -Xmx${task.memory.giga}g -Djava.io.tmpdir=." HaplotypeCaller \
    -R ${params.ref} \
    -I {bam.toRealPath()} \
    -O "${bam.getBaseName()}.vcf.gz" \
    --dbsnp "${params.dbsnp}" \
    -L "${params.intervals}"
"""
}
ADD COMMENTlink written 5 weeks ago by Pierre Lindenbaum134k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1184 users visited in the last hour
_