Hello, I aim to identify SNPs from approximately 500 BAM files (non-human). I'm opting for bcftools since GATK, even with the Spark addition, takes a substantial 6 hours per sample. My objective is to generate a single VCF file encompassing all SNPs detected across the 500 samples. I'm considering two approaches:
Utilizing mpileup to process all BAM files simultaneously and subsequently calling SNPs. However, this method lacks parallelization, potentially resulting in a prolonged runtime.
Employing a parallelized approach by using mpileup on each file separately, allowing parallelization with a single thread for each run (so about 30 files simultaneously). Post-calling, I plan to merge the individual VCF files into one consolidated file. This approach may optimize the process, with the merging potentially outpacing the mpileup process.
Your insights on the most efficient strategy would be greatly appreciated.