Samtools Mpileup Vs Pileup: Multiple Samples And Gt=0/0
1
5
Entering edit mode
12.4 years ago

Better late than Never:

1) Is there a difference between:

samtools mpileup (options) sample1.bam sample2.bam  sample3.bam


and

samtools mpileup (options) sample1.bam
samtools mpileup (options) sample2.bam
samtools mpileup (options) sample3.bam


I mean, for one bam, does mpileup uses the reads from the other samples for its calculations ?

2) In a mpileup output, What's the best way to test if a sample does not carry a SNP ? Is looking at "GT=0/0" enough ?

$1 #CHROM chr1$2    POS                                        9997
$3 ID .$4    REF                                        N
$5 ALT A$6    QUAL                                       3.55
$7 FILTER .$8    INFO                                       DP=1;AF1=1;AC1=10;DP4=0,0,1,0;MQ=60;FQ=-25.5
$9 FORMAT GT:PL:DP:GQ$10    sample1.bam                             0/1:31,3,0:1:5
$11 sample2.bam 0/0:0,0,0:0:3$12    sample3.bam                             0/0:0,0,0:0:3
$13 sample4.bam 0/0:0,0,0:0:3$14    sample5.bam                             0/0:0,0,0:0:3

samtools bam mpileup • 9.8k views
2
Entering edit mode

1) yes. 2) also set a threshold on GQ.

6
Entering edit mode
11.4 years ago
Christof Winter ★ 1.0k

Quoting from https://github.com/samtools/samtools/wiki/FAQ:

1. Between single- and multi-sample variant calling, which is preferred?

By using multi-sample calling, we gain power on SNPs shared between samples, but lose power on singleton SNPs. Here is a way of thinking of this. Suppose we have 1% false positive rate (FPR) for variant calling from one sample. If we call SNPs from 100 samples separately and then combine the calls, the FPR would be around 10-20% (not 100% because more SNPs are found given 100 samples). To retain an acceptable FPR on singletons, we have to be more stringent on each sample and thus lose power. Combining single-sample calls naively would not increase power on shared SNPs. This is where multi-sample calling does better: by taking the advantage of correlation between samples, we are able to call a SNP if it appears in multiple samples, but too weak to call in each sample individually. Joint calling is particularly preferable if we have multiple low-coverage samples for which single-sample calling does not work well. It is also able to reveal some artifacts only detectable with many samples.

In all, if you have deep coverage and need to study each sample separately, you should use single-sample calling. If you have low-coverage data or only care about variants from multiple samples as a whole, you should use multi-sample calling. Understanding the difference between single- and multi-sample calling also helps experimental design: if you only want to get a set of SNPs from many samples or to do association studies, sequencing to deep coverage is a waste. You pay much more only to get marginal reward.