Regarding vcf output
1
1
Entering edit mode
4.4 years ago
DL ▴ 40

Hello everyone,

I have one doubt related to vcf output. The output looks like:

****#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1008**

Chr01   9484    .       G       A       1006.77 .AC=2;AF=1.00;AN=2;**DP=31**;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=52.12;QD=26.83;SOR=0.874        GT:AD:**DP**:GQ:PL  1/1:0,22:**22**:69:1035,69,0

Chr01   9488    .       C       G       1051.77 .       AC=2;AF=1.00;AN=2;DP=33;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=51.88;QD=30.42;SOR=0.963        GT:AD:DP:GQ:PL  1/1:0,23:23:72:1080,72,0

Chr01   9505    .       G       C       1051.77 .       AC=2;AF=1.00;AN=2;DP=35;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=51.67;QD=24.50;SOR=0.859        GT:AD:DP:GQ:PL  1/1:0,24:24:72:1080,72,0


From the last few days, i am just reading about the format vcf file and how can i filter snps using different flags. Mostly in papers, on the basis of DP and QUAL values snps are filtered. But now i am little bit confused. In vcf file 2 DP value are define: one is sample depth and another is allelic depth and when i give the parameter DP>10. It gave me output just like as i mentioned above. Now my question is that Should i filtered snps on the basis of allelic depth because reads used in allelic depth are filtered and informative as mentioned in GATK. i do not understand this concept. Can anyone give me some suggestions

Thanks & Regards

Deepika .

SNP next-gen snp • 1.7k views
0
Entering edit mode

You might also consider filtering on QD rather than QUAL and DP: https://software.broadinstitute.org/gatk/documentation/article.php?id=6925

0
Entering edit mode

Thank you but my question is still same.

2
Entering edit mode
4.4 years ago
aays ▴ 170

A good answer to this can be found here. Briefly, DP refers to the total filtered depth of reads regardless of which allele was called at a given read, while the AD field returns the number of reads that support each of the alleles reported in the VCF. This is why AD is two values in your example (0, 23 for the second record, which means 0 REF reads and 23 ALT reads) and DP is a single value.

With regards to the two DP values per record in your VCF - notice that one is in the INFO field and another is in FORMAT. The ones referred to above are both in FORMAT, which returns sample-wise values. INFO fields are often custom to each VCF, but there is likely a reference for what DP in the INFO field actually represents in the VCF header at the top of your file. The line should start somewhat like this:

##INFO=<ID=DP [etc]


and ought to contain a description of what that DP parameter represents as opposed to the FORMAT one.

0
Entering edit mode

Thanks for your response but still i do not understand ? Sorry for that In the output AC=2;AF=1.00;AN=2;DP=31 GT:AD:DP:GQ:PL 1/1:0,22:22:69:1035,69,0

if you see there are two different value for DP --> DP=31 and DP 22. Why its showing different DP values for same SNPs.

Thanks

0
Entering edit mode

Right, again - AC=2;AF=1.00;AN=2;DP=31 falls under the INFO column, and a guideline for what that DP means is likely found in your VCF header at the top of the file.

GT:AD:DP:GQ:PL 1/1:0,22:22:69:1035,69,0, however, falls under FORMAT. The FORMAT DP parameter is more clearly defined, and refers to the number of reads that support each of the reported alleles.

Although both of these things are notated as DP, they have different meanings because of the columns they're in. The FORMAT DP always means what I mentioned above, in that a DP value is specified for each sample in the VCF.

The INFO DP, on the other hand, is more likely a summary statistic of all the samples at that site. However, I'm not certain what its exact meaning is without knowing how your VCF was called.

I think this is a helpful link for looking into the issue more.

0
Entering edit mode

Thank you so much for you quick response and cleared me about this thing. I have another doubt please suggest me. Actually when i am going to filter snps on the basis of DP value > 10 then it considered info DP value not format DP value but i want to filter out snps on basis of format DP value because i am using only one sample. I am new in this field so its taking time to understand the things.

Again Thank you for explanation.

1
Entering edit mode

Glad I could help! And no worries about taking time to understand things - bioinformatics can certainly have a bit of a steep learning curve.

Can I ask what you're using to filter by DP value? The Python package pyVCF, which I use myself and would recommend, allows you to pull up the DP value for each given sample at a site. One could then use standard Python functionality to filter for DP > 10 and use the vcf.Writer functionality to write those records to a new VCF if desired.

0
Entering edit mode

Hi, yes, i am using vcffilter to filter the SNPs. i also saw python package pyVCF. Can you tell me that how you are filtering the snps??? please mention the command using in this package?

Thanks & Regards Deepika