Question: Snp Number From Whole-Genome Sequencing
I have a whole-genome sequencing data from Illumina company. The SNPs number I called using samtools/pileup, samtools/mpileup and SNVMix is 3,752,858, 3,959,353, and 5,836,045 for tumor sample, respectively. The corresponding SNPs number for blood sample is 3,512,896, 3,686,117, and 5,456,739.

It have been said that samtools/mpileup is better than samtool/pileup (for single-sample SNP calling, they differ little), and SNVMix is suitable especially for cancer sequencing. So the SNPs number from SNVMix should be less than samtools/mpileup, and samtools/mpileup should be less than samtools/pileup.

Why here the number of SNPs from samtools/pileup < samtools/mpileup < SNVMix? Do the SNPs number have any problems here? Thanks.

Try plotting the SNP call quality histograms. It may also help to take a quick look at Ti/Tv ratios. Reference: Ti/Tv Ratio Confirms Snp Discovery. Is This A General Rule?

Vancouver, BC, Canada
Generally snp callers work in two passes, where the first pass identifies every position with at least one mismatching base, and the second pass filters these results to generate the list of snps you think are "real".

I get the sense that you have only done the first pass.

You need to decide how to set the parameters for the second pass appropriately depending on how you want to balance sensitivity vs specificity.

SNVMix results are filtered with using -t to set the probability threshold. The resulting file only contains positions where p(bb)+p(ab)>=T for your specified value of T.

samtools pileup results are filtered with varFilter using a wide variety of filtering options including the phred scaled snp quality, depth, distance to the closest indel and more.

samtools mpileup results are filtered with bcftools/ varFilter which again uses a wide variety of filtering options.

What is your sequencing coverage? did you do any quality filtering on the SNPs to result in the numbers posted here? generally I think it's a sensitivity vs specificity issue here.

Thanks. The sequencing coverage is minimum 30X. I have do the quality filtering before calling SNPs such as remove repeats, duplications and the reads that fail platform/vendor quality check. Do you have any suggestions about how to decide the threshold about the sensitivity vs specificity issue here?

