Question

SNP calling with many samples using bcftools

0

Entering edit mode

20 months ago

George ▴ 20

Hello, I aim to identify SNPs from approximately 500 BAM files (non-human). I'm opting for bcftools since GATK, even with the Spark addition, takes a substantial 6 hours per sample. My objective is to generate a single VCF file encompassing all SNPs detected across the 500 samples. I'm considering two approaches:

Utilizing mpileup to process all BAM files simultaneously and subsequently calling SNPs. However, this method lacks parallelization, potentially resulting in a prolonged runtime.
Employing a parallelized approach by using mpileup on each file separately, allowing parallelization with a single thread for each run (so about 30 files simultaneously). Post-calling, I plan to merge the individual VCF files into one consolidated file. This approach may optimize the process, with the merging potentially outpacing the mpileup process.

Your insights on the most efficient strategy would be greatly appreciated.

bcftools SNP multithreading • 1.4k views

ADD COMMENT • link 20 months ago by George ▴ 20

0

Entering edit mode

since GATK, even with the Spark addition, takes a substantial 6 hours per sample.

how do you call with gatk ? do you use GVCF ? how does it compare to bcftools ? bcftools would take a huge amount of time.

ADD REPLY • link 20 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hello, and thank you for your response! Here is the command I used to invoke Spark with GATK HaplotypeCaller: gatk HaplotypeCallerSpark -I myfile.bam -R my.ref.fasta -O out.vcf

Notably, I did not employ the -ERC GVCF. In all my testing GATK was notably slower.

ADD REPLY • link 20 months ago by George ▴ 20

0

Entering edit mode

I think GATK would be faster in GVCF for 500 bams

ADD REPLY • link 20 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

Are you suggesting that the speed improvement applies specifically when compared to the bcftools gvcf output, or is it faster overall? Personally, I don't require the GVCF format for my analysis; I only need the SNPs. However, if the process is faster, I see no downside in using it.

ADD REPLY • link 20 months ago by George ▴ 20

0

Entering edit mode

How does it compare to bcftools ?

Sorry, I forgot to answer this. Bcftools tool with mpileup+call+filtering takes about an hour with one thread. So since I have 30 threads I can process 30 files per hour instead of one every 6 hours with spark.

ADD REPLY • link 20 months ago by George ▴ 20

0

Entering edit mode

sorry it's still not clear: do you want to process the 500 bams in one invocation of bcftools (sloooww) or do you want to process one bam per bcftools ?

ADD REPLY • link 20 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

This is actually my question and sorry for being unclear. What is better one bam at a time and then merging into one vcf or mpileup 500 bams from the beginning?

ADD REPLY • link 20 months ago by George ▴ 20