How to select SNPs the most conservative way after WGS Variant Calling?
1
0
Entering edit mode
5.2 years ago
serpalma.v ▴ 80

Hello!

I have made a raw (unfiltered) variant call set following GATK best practices (VCF file with ~16 Million SNPs produced by GenotypeGVCFs). The original WGS data corresponds to 60 samples sequenced at a average coverage of 20x.

We want to identify a small subset of really good SNPs and another subset of really bad SNPs, which we could use for validation.

How can I construct a filter that keeps SNPs most likely to be true and false positives, respectively?

A first choice would be to rank by QUAL and pick the SNPs at the top and the bottom of the list, but I am sure there is a more sofisticated way to do this.

Also, since the VCF contains multiple samples, would it be better to filter by site or by genotype?

Thanks and I appreciate your feedback!

SNP sequencing next-gen • 1.1k views
ADD COMMENT
0
Entering edit mode
5.2 years ago

You could make use of depth and allele frequencies as well. The more samples you have to more difficult is to understand how was the QUAL field computed and what weight it assigns to the data.

In addition, you could run a second SNP caller and take the SNPs identified by both more "credible".

ADD COMMENT
0
Entering edit mode

Thanks Istvan

OK, I will call variants with SAM/BCFtools on the same BAMs as well. Then I can subset both raw call sets by depth and allele frequency. Then consider common intersecting SNPs as the good ones.

To filter by depth, I guess that I could only take the SNPs where all samples have a depth >= 30x as per this white paper.

To filter by allele frequency (provided that SNPs have the required depth), I was thinking to keep SNPs where all homozygous samples have an allele frequency of 1 or all heterozygous samples have an allele frequency of 0.5, as it has been stated in this review.

Did I get this right?

ADD REPLY

Login before adding your answer.

Traffic: 2413 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6