Question

How to effectively use GATK haplotype caller with -L breaking up BAM files

1

Entering edit mode

4.4 years ago

mgdesaix ▴ 20

Hello,

I've been browsing forums and trying to figure out how to run haplotypecaller from GATK more efficiently. I want to minimize the size of each job I'm submitting because I've had issues with jobs timing out on my cluster (using SLURM) and also taking forever to run when they do. I had been running:

"$gatk" HaplotypeCaller -R "$reference" -I "$bam" -O "$out"."$gvcf".g.vcf -ERC GVCF

with the appropriate variables filled in. I want to break up the bamfiles using -L option and split into manageable chunks based on scaffolds. The header of a bamfile looks like this:

[mgdesaix@xxxx]$ samtools view Plate1.18N00490_RG.bam -H

@HD VN:1.6 SO:coordinate

@SQ SN:scaffold1|size5275185 LN:5275358

@SQ SN:scaffold2|size3399639 LN:3399639

@SQ SN:scaffold3|size3342599 LN:3342599

@SQ SN:scaffold4|size3742848 LN:3742848

etc., ending with

@SQ SN:scaffold45765|size8352 LN:8352

@SQ SN:scaffold47060|size9013 LN:9013

@SQ SN:scaffold47992|size8514 LN:8514

then ID, LB, etc.

To use -L, would I specify "$gatk" HaplotypeCaller -R "$reference" -I "$bam" -O "$out"."$gvcf".g.vcf -ERC GVCF -L scaffold1:10000 to only produce a gvcf file for the first 10,000 scaffolds? And if so and then I produce multiple gvcf files for an individual broken up by the scaffolds, does GATK seamlessly merge the gvcf files by ID once I combine all of them or is there an extra step I need to do to make sure individuals are recombined (individuals in this case are specified by unique ID)?

If anyone has input on this or where to find an example that would be great! So far I hadn't been able to figure it out based on my interpretation of the GATK documentation and browsing elsewhere.

Thanks,

Matt

GATK variant calling • 2.0k views

ADD COMMENT • link updated 4.4 years ago by rbagnall ★ 1.8k • written 4.4 years ago by mgdesaix ▴ 20

score 1 · Answer 1 · 2019-11-21

1

Entering edit mode

4.4 years ago

rbagnall ★ 1.8k

You want to pass the first 10,000 scaffolds in a text file called an interval list

-L path/to/my_interval.list

Where an interval list file to include the first four scaffolds would look like:

scaffold1

scaffold2

scaffold3

scaffold4

This will give a single g.vcf for the first 10,000 scaffolds.

For your example, you will end up with a g.vcf for scaffolds 1 - 10,000, another for 10,001 - 20,000, another for 20,001 - 30,000 etc. You can either merge them using gatherVCFs, or leave them as they are and genotype them in parallel.

ADD COMMENT • link 4.4 years ago by rbagnall ★ 1.8k

0

Entering edit mode

Ah, that makes so much sense...I knew it must be something simple that I was missing. I will give it a go.

Thanks!

ADD REPLY • link 4.4 years ago by mgdesaix ▴ 20