I am trying out Pindel (https://github.com/genome/pindel) with our exome data, and even on demo datasets it is taking extremely long to run, way longer than is feasible in a production setting. Using Pindel version 0.2.5b9, 20160729, running in Singularity container on our HPC cluster. My config file looks like this:
$ cat pindel_config.txt SeraCare.dd.bam 500 Tumor HapMap.dd.bam 500 Normal
My command looks like this:
pindel --fasta genome.hg19.fa \ --config-file pindel_config.txt \ --output-prefix output/ \ --number_of_threads 40 \ --include targets.bed
I am running with 40 CPU threads, in a SLURM job with 320GB RAM allocated. My
targets.bed has 10,600 regions. Judging from the logs, Pindel is correctly searching in only the supplied target regions. However, after 6 hours, it has only gotten through 4,500 regions of the 10,600 provided. At this rate, it will take ~10 hours to finish; however this is only a demo dataset, it will likely take much longer on our real exome samples, and I have to run many of these per sequencing run. That will far exceed our available time and compute resources.
It seems I need to chunk the areas of the genome for Pindel to analyze further so that I can run more jobs in parallel for a more reasonable completion time. However I am not sure what the best method for this is, I do not want to compromise the integrity of the results. Is it safe to chunk per-chromosome? Or could I chunk even further, supplying only ~100 target regions per job without affecting the results?