Question: Dp And Dp4 In Vcf
1
gravatar for Juliofdiaz
7.5 years ago by
Juliofdiaz130
Toronto, Ontario, Canada
Juliofdiaz130 wrote:

I'm confused about some numbers I get in a VCF file. This is an example:

gi|218888746|ref|NC_011770.1| 214 . G C 202 . DP=45;AF1=1;AC1=2;DP4=0,0,25,13;MQ=55;FQ=-141 GT:PL:GQ 1/1:235,114,0:99

The raw depth coverage at this positon is 53, as seen in samtools depth and confirmed in IGV, and DP=45, so I guess that 8 of the 53 reads are below the sequencing quality threshold set by samtools, which I think is q20 (Please correct me if this is wrong), My actual confusion comes when I look at the DP4 numbers which totals 38, so if DP is already quality filtered, shouldnt DP equal the sum of DP4 values? Otherwise, what is the different the "high-quality" filtering to get to the DP4 values and the filtering to get to the DP value. Thanks

samtools • 8.9k views
ADD COMMENTlink modified 3.8 years ago by Shicheng Guo7.9k • written 7.5 years ago by Juliofdiaz130
3
gravatar for Shicheng Guo
3.8 years ago by
Shicheng Guo7.9k
Shicheng Guo7.9k wrote:

OK. Now the question is totally clear. samtools mpileup conducted too much filtering, therefore, the depth from the mpileup usually far less than samtools depth

1, mapping qulity (samtools mpileup -q)

2, base qulity (samtools mpileup -Q)

3, discard anomalous read pairs (samtools mpileup -A)

After remove these filtering, the number of depth from these two method will be exactly same. 

 

samtools depth <SORTEDBAM>
samtools mpileup -A -q 1 -Q 1 <Reference> <SORTEDBAM>

 

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Shicheng Guo7.9k
2
gravatar for Zev.Kronenberg
7.5 years ago by
United States
Zev.Kronenberg11k wrote:

From samtools:

link

DP = The number of reads covering or bridging POS.

DP4 = Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles, used in variant calling. Sum can be smaller than DP because low-quality bases are not counted.

I don't know why you have a discrepancy in what samtools depth is telling you and what mpileup is telling you. Are you using data generated from the exact same BAM?

Did you set a quality threshold when you did samtools depth?

ADD COMMENTlink modified 3 days ago by RamRS25k • written 7.5 years ago by Zev.Kronenberg11k

Yes, I am using the exact sameBAM. This is the command I use for depth

samtools depth <SORTEDBAM>

And this is the command I use for the VCF

samtools mpileup -uf <REFFILE> <SORTEDBAM>
ADD REPLYlink modified 3 days ago by RamRS25k • written 7.5 years ago by Juliofdiaz130

Besides trying to gain general knowledge, what are goal of your project? Are you just trying to call SNPs? I ask because maybe there is a tool biostar could recommend.

ADD REPLYlink written 7.5 years ago by Zev.Kronenberg11k

Yes, I am trying to call high quality SNPs from a bacterial genome. I am trying to do this because I want to do GATK's quality recalibration, and you need a set of known SNPs (there is no SNP database for my organism of interest). Besides trying to fully understand the data that is coming out of SAMTOOLs. Thanks

ADD REPLYlink written 7.5 years ago by Juliofdiaz130

Will you be using the same dataset to generate the list of SNPs and then use GATK to go back over and call them?

ADD REPLYlink written 7.5 years ago by Zev.Kronenberg11k

I have about 30 genomes of the same strain so I will pool the high quality SNPS from those, and select only the ones that are present in all genomes. Besides I will include some of the known SNPs (kind of manually). This may introduce a small bias but to recalibrate the alignment quality scores it shouldnt be too bad.

ADD REPLYlink written 7.5 years ago by Juliofdiaz130

Samtools can take all 30 genomes simultaneously to help it make good calls for low coverage regions. You could just give samtools all the BAMs at once. Of course I am biased since I only use Samtools for variant calling. I wouldn't use the same genomes to inform GATK.

ADD REPLYlink written 7.5 years ago by Zev.Kronenberg11k

At this point Im not too worried about calling low coverage regions because I will call those for each genome using GATK;s SNP calling guidelines. At this point I just want to call the SNPs that are highly reliable. You are right, using SNPs from the same genomes is not the best approach but the lack of a dbSNP for my organism leaves little choice. My post was more directed towards understanding why my depth is different than my DP, which is different than the sum of DP4 values.

ADD REPLYlink written 7.5 years ago by Juliofdiaz130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 829 users visited in the last hour