Safe ways to chunk data to speed up Pindel?
0
0
Entering edit mode
4.8 years ago
steve ★ 3.5k

I am trying out Pindel (https://github.com/genome/pindel) with our exome data, and even on demo datasets it is taking extremely long to run, way longer than is feasible in a production setting. Using Pindel version 0.2.5b9, 20160729, running in Singularity container on our HPC cluster. My config file looks like this:

$ cat pindel_config.txt
SeraCare.dd.bam 500 Tumor
HapMap.dd.bam   500 Normal

My command looks like this:

pindel --fasta genome.hg19.fa \
--config-file pindel_config.txt \
--output-prefix output/ \
--number_of_threads 40 \
--include targets.bed

I am running with 40 CPU threads, in a SLURM job with 320GB RAM allocated. My targets.bed has 10,600 regions. Judging from the logs, Pindel is correctly searching in only the supplied target regions. However, after 6 hours, it has only gotten through 4,500 regions of the 10,600 provided. At this rate, it will take ~10 hours to finish; however this is only a demo dataset, it will likely take much longer on our real exome samples, and I have to run many of these per sequencing run. That will far exceed our available time and compute resources.

It seems I need to chunk the areas of the genome for Pindel to analyze further so that I can run more jobs in parallel for a more reasonable completion time. However I am not sure what the best method for this is, I do not want to compromise the integrity of the results. Is it safe to chunk per-chromosome? Or could I chunk even further, supplying only ~100 target regions per job without affecting the results?

pindel indel • 905 views
ADD COMMENT
0
Entering edit mode

I have mixed feelings about pindel. A trusted colleague working in cancer genomics in New York recently told me that it has a high false positive rate. I still use it but had already assumed many of the calls were false-positives. I guess that one can mitigate these calls by increasing read-depth when filtering.

I don't believe there are any drawbacks to 'chunking' the analysis per chromosome. I have not yet benchmarked it with / without chunking. Inter-chromosomal events may not be called when one chunks by chromosome, though - I'm just not sure.

ADD REPLY

Login before adding your answer.

Traffic: 1870 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6