Question: Samtools Mpileup Vs Pileup: Multiple Samples And Gt=0/0
gravatar for Pierre Lindenbaum
8.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum133k wrote:

Better late than Never:

1) Is there a difference between:

samtools mpileup (options) sample1.bam sample2.bam  sample3.bam


samtools mpileup (options) sample1.bam
samtools mpileup (options) sample2.bam
samtools mpileup (options) sample3.bam

I mean, for one bam, does mpileup uses the reads from the other samples for its calculations ?

2) In a mpileup output, What's the best way to test if a sample does not carry a SNP ? Is looking at "GT=0/0" enough ?

$1    #CHROM                                     chr1
$2    POS                                        9997
$3    ID                                         .
$4    REF                                        N
$5    ALT                                        A
$6    QUAL                                       3.55
$7    FILTER                                     .
$8    INFO                                       DP=1;AF1=1;AC1=10;DP4=0,0,1,0;MQ=60;FQ=-25.5
$9    FORMAT                                     GT:PL:DP:GQ
$10    sample1.bam                             0/1:31,3,0:1:5
$11    sample2.bam                             0/0:0,0,0:0:3
$12    sample3.bam                             0/0:0,0,0:0:3
$13    sample4.bam                             0/0:0,0,0:0:3
$14    sample5.bam                             0/0:0,0,0:0:3
bam samtools mpileup • 8.1k views
ADD COMMENTlink modified 7.9 years ago by Christof Winter990 • written 8.9 years ago by Pierre Lindenbaum133k

1) yes. 2) also set a threshold on GQ.

ADD REPLYlink written 8.9 years ago by lh332k
gravatar for Christof Winter
7.9 years ago by
Lund, Sweden
Christof Winter990 wrote:

Quoting from

1. Between single- and multi-sample variant calling, which is preferred?

By using multi-sample calling, we gain power on SNPs shared between samples, but lose power on singleton SNPs. Here is a way of thinking of this. Suppose we have 1% false positive rate (FPR) for variant calling from one sample. If we call SNPs from 100 samples separately and then combine the calls, the FPR would be around 10-20% (not 100% because more SNPs are found given 100 samples). To retain an acceptable FPR on singletons, we have to be more stringent on each sample and thus lose power. Combining single-sample calls naively would not increase power on shared SNPs. This is where multi-sample calling does better: by taking the advantage of correlation between samples, we are able to call a SNP if it appears in multiple samples, but too weak to call in each sample individually. Joint calling is particularly preferable if we have multiple low-coverage samples for which single-sample calling does not work well. It is also able to reveal some artifacts only detectable with many samples.

In all, if you have deep coverage and need to study each sample separately, you should use single-sample calling. If you have low-coverage data or only care about variants from multiple samples as a whole, you should use multi-sample calling. Understanding the difference between single- and multi-sample calling also helps experimental design: if you only want to get a set of SNPs from many samples or to do association studies, sequencing to deep coverage is a waste. You pay much more only to get marginal reward.

ADD COMMENTlink modified 13 months ago by _r_am32k • written 7.9 years ago by Christof Winter990
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1737 users visited in the last hour