I've been using samtools and bcftools to call variants from 16 human whole genome runs. They were obtained from ncbi in.sra format and were extracted to .bam format. Each of the 16 Bams were sorted, indexed and I used samtools mpileup to generate the .bcf files.
To accomplish this in a reasonable amount of time I split the data into 1 megabase windows (e.g. chr1:1-1000000, chr1:1000001-2000000) which translate to 3,114 windows. I ran them in a cluster accesible to my lab, running 800 jobs at a time. It finished in a reasonable amount of times and I proceeded to generate index files for eac bam using (this will be a simplified version):
#BATCH --array-1-3114%800 bcftools index <fileWIndow,bcf>
I assumed that it would take equal or less amount of resources than generating the .bcf files, it didn't. For some reason it substantially slowed the cluster (It's been 0 days since I last brought the cluster to it's knees) After talking with the cluster support team we didn't really reach any solid conclusion on what was going on, I'm guessing if I understood better what the indexing algorithm is doing I might figure out what's going on.
Could anyone roughly explain why the marked differenc ein performance between mpileup and bcftools index? I'm aware the algorithms are different but I would assume making and index is faster than generating the bcf
Thanks in advance if someone decides to tackle this with me, I'll be happy to provide as much info as possible
How does bcftools index algorithm generate the indexes? What resources would it be using most? Why would it be more resource demanding than creating bcf files?