Question: Indel Discovery Delly, Pindel, Samtools, Gatk
1
gravatar for rob234king
5.4 years ago by
rob234king560
UK/Harpenden/Rothamsted Research
rob234king560 wrote:

I'm testing out samtools vs GATK for snp and indel calling, and looking at using pindel for SV in particular focusing on insertions and Delly for the other SV. What experience do people have of SNP calling and SV tools including filtering options?

Some possible discussion points: 1. Has anyone used Delly and found the results to be comparable with other SV detectors? 2. What filtering parameters do people use for SNP calling and indels?

I'm reading through papers on INDEL calling and samtools looks good for SNPs, GATK is better at small indels than samtools, but for larger INDELS and SV it doesn't seem as clear cut, I presume because it is a more complicated mechanism of detection.

Anybody got any recommendations of software they would use for a SNP/INDEL/SV calling pipeline. I've seen an older biostars post but there was no mention of larger indels and SV.

indel gatk samtools snp pindel • 5.9k views
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by rob234king560
1

you might want to break this up into 4 questions - this is too much for one post

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Jeremy Leipzig18k

I'll change but 1 to 4 are discussion points for a SNP/SV software and options which fall under the general umbrella question.

ADD REPLYlink written 5.4 years ago by rob234king560

Is your data whole-genome or exome? PM-me if it is exome.

ADD REPLYlink written 5.4 years ago by mchaisso160

Whole genome data

ADD REPLYlink written 5.4 years ago by rob234king560
4
gravatar for Mahdi Sarmady
5.4 years ago by
Mahdi Sarmady290
USA
Mahdi Sarmady290 wrote:

This is not a question with a single answer but as you said, I can share our story with you.

For our whole exome pipeline, we used to have Novoalign + GATK Indel Realigner + GATK BQSR + Unified Genotyper for both indel and snps it works great for indels up to 30bp. The problem with this pipeline is GATK Indel Raligner + GATK BQSR would take about 40% of total running time which for whole genome can be huge. Also I based on GATK website, version 2.6+ of GATK Haplotype Caller works better and faster both for SNPs and Indels and it can detect larger indels as well. I compared four pipelines using the GCAT tool (Novoalign version is 3.01 and GATK version is 2.6-5 and markdups was done using Picard in all four after alignment):

  1. Novoalign + GATK Indel Realigner + GATK BQSR + GATK Unified Genotyper
  2. Novoalign + GATK Haplotype Caller
  3. Novoalign + GATK Unified Genotyper
  4. Novoalign + GATK Indel Realigner + GATK BQSR + GATK Haplotype Caller

You can view the report of the comparison here. Based on this comparison, we chose Novoalign V3 directly (of course after marking dups) followed by GATK Haplotype Caller (version 2.6+) both for SNPs and small indels. Since Novoalign does base quality realignment and it guarantees optimal alignment, based on our comparison, GATK BQSR and Indel Realigner do not have significant impact on the results and given their speed, we removed them from our pipeline.

As we moved to whole genome, we surveyed a number of structural variation callers including: Pindel, Breakdancer, Delly, Lumpy and CNVnator. We chose combination of Delly (delly, jumpy, invy, duppy) and CNVnator for large insertions. All the comparison was done by comparing a 60x whole genome data with the same sample CNV array data and only the combination of these two tools were able to call the most complex structural variants for the sample we used for comparison.

I should mention that for all these comparisons, we were looking for the best balance of speed, sensitivity and specificity. Some of the structural variation tools run way too slow which make them impractical to use in a pipeline used to process hundreds of whole genome samples (at least within our infrastructure).

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Mahdi Sarmady290

Interesting thanks for posting. Novoalign maps less but better specificity than BWA but requires a licence for multithread. Why did you go with novoalign? Is it because snp discovery specficity more important to you than sensitivity? Interested in your opinion on using novoalign over BWA-MEM, for instance I can map 99.8% of a tomato genome with BWA and 90% using novoalign.

ADD REPLYlink written 5.4 years ago by rob234king560

We chose Novoalign because according to Heng Li's BWA-MEM manuscript (and many other benchmarks) "On accuracy, NovoAlign is the best". We do have the license and it does multi-threading and MPI very well. Although we work only with human data so all the information I wrote in my answer is based on experience with human data.

ADD REPLYlink written 5.4 years ago by Mahdi Sarmady290

This has been very helpful thanks. Are you aloud to say roughly how much was your licence?

ADD REPLYlink written 5.4 years ago by rob234king560

I don't think it is allowed, but contact them, the price is very reasonable.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Mahdi Sarmady290

I agree with you that IndelRealigner and BQSR definitely take up a large percentage of the pipeline you mentioned; but I think that for people processing less samples and are determined to get the most accurate results, it may be worth it to follow GATK's best practices and therefore keep IndelRealigner and BQSR. I like to believe we do not suffer through those extra 25hrs for nothing (also with human data); although you have definitely followed a systematic approach that suggests it may not be necessary.

ADD REPLYlink written 5.1 years ago by Dalia30

We chose combination of Delly (delly, jumpy, invy, duppy) and CNVnator for large insertions.??? What about deletion? I guess here's your typo, you would wanna say deletion?

ADD REPLYlink written 2.9 years ago by michealsmith740
1
gravatar for Jeremy Leipzig
5.4 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

Does the fact that BWA-MEM trim the reads Q15 by default effect pindel because some reads will be shorter than expected and the insert size will be slightly off for some paired reads?

no - 3' trimming should not affect where the reads map and therefore the observed insert size should be the same

ADD COMMENTlink written 5.4 years ago by Jeremy Leipzig18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 966 users visited in the last hour