Question

Safe ways to chunk data to speed up Pindel?

0

Entering edit mode

4.8 years ago

steve ★ 3.5k

I am trying out Pindel (https://github.com/genome/pindel) with our exome data, and even on demo datasets it is taking extremely long to run, way longer than is feasible in a production setting. Using Pindel version 0.2.5b9, 20160729, running in Singularity container on our HPC cluster. My config file looks like this:

$ cat pindel_config.txt
SeraCare.dd.bam 500 Tumor
HapMap.dd.bam   500 Normal

My command looks like this:

pindel --fasta genome.hg19.fa \
--config-file pindel_config.txt \
--output-prefix output/ \
--number_of_threads 40 \
--include targets.bed

I am running with 40 CPU threads, in a SLURM job with 320GB RAM allocated. My targets.bed has 10,600 regions. Judging from the logs, Pindel is correctly searching in only the supplied target regions. However, after 6 hours, it has only gotten through 4,500 regions of the 10,600 provided. At this rate, it will take ~10 hours to finish; however this is only a demo dataset, it will likely take much longer on our real exome samples, and I have to run many of these per sequencing run. That will far exceed our available time and compute resources.

It seems I need to chunk the areas of the genome for Pindel to analyze further so that I can run more jobs in parallel for a more reasonable completion time. However I am not sure what the best method for this is, I do not want to compromise the integrity of the results. Is it safe to chunk per-chromosome? Or could I chunk even further, supplying only ~100 target regions per job without affecting the results?

pindel indel • 905 views

ADD COMMENT • link 4.8 years ago by steve ★ 3.5k

0

Entering edit mode

I have mixed feelings about pindel. A trusted colleague working in cancer genomics in New York recently told me that it has a high false positive rate. I still use it but had already assumed many of the calls were false-positives. I guess that one can mitigate these calls by increasing read-depth when filtering.

I don't believe there are any drawbacks to 'chunking' the analysis per chromosome. I have not yet benchmarked it with / without chunking. Inter-chromosomal events may not be called when one chunks by chromosome, though - I'm just not sure.

ADD REPLY • link 4.8 years ago by Kevin Blighe 87k