Inconsistency in SNP detection pipelines for multi-sample analysis
1
1
Entering edit mode
13 days ago
George ▴ 10

Hi everyone,

I'm working on a clustering analysis of multiple samples. I have BAM files that went through QC. My current workflow involves:

  1. Variant calling with bcftools (To replicate GVCF behavior and avoid issues with 0/0 genotypes appearing as missing, I don't use the -v option).
  2. Merging the resulting BCF files with bcftools merge
  3. Filtering insertions/deletions (indels)
  4. Splitting multiallelic sites
  5. Clustering

I'm curious about a potential issue with this method. mpileup has a -b option that allows calling variants from a list of BAM files directly, bypassing the merging step (simulating a merged .bam file). However, the results from these two approaches differ:

  1. bcftools merge method: Finds ~100,000 more SNPs, but lacks 12,000 SNPs identified by the mpileup -b method.
  2. mpileup -b method: Finds fewer SNPs but includes 12,000 unique SNPs

While mpileup -b identifies fewer SNPs, its clustering aligns better with my expectations, suggesting higher accuracy.

However, this method comes at a cost: it's significantly slower and limited to single-threaded execution,making it impossible to run for 100+ samples. In contrast, calling variants individually and then merging them allows for parallelization, offering a substantial speed advantage.

Does anyone have insights into why these methods produce different results?

I'm happy to share my code and any additional information to help diagnose the issue.

Out of concern that indel calling might influence the results, I also attempted variant calling with the -I (mpileup) option. However, the merged VCF file was identical to the one generated by the first method without -I, after the filtering of Indels.

vcf bam bcf snps • 293 views
ADD COMMENT
0
Entering edit mode
13 days ago
LChart 3.9k

To replicate GVCF behavior and avoid issues with 0/0 genotypes appearing as missing, I don't use the -v option

I'm not sure what you mean by "replicate GVCF behavior". Why not write out a gvcf using bcftools call --gvcf? The call quality from merging from GVCFs should be roughly the same as from using mpileup.

However, this method comes at a cost: it's significantly slower and limited to single-threaded execution,making it impossible to run for 100+ samples. In contrast, calling variants individually and then merging them allows for parallelization, offering a substantial speed advantage.

Just a quibble: you can always parallelize over regions by specifying -r.

ADD COMMENT

Login before adding your answer.

Traffic: 2454 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6