what is the properties of filtering the vcf files
1
0
Entering edit mode
10.0 years ago
stat.1405 ▴ 30

I have a filter file , it was filtered based on low quality<11 , indelGap , snpgap.

First, why they choose the threshold 11 ?

Second, what is the meaning of snpgap and IndelGap ?

Finally, is there any way or evidence that tells me the data should be filtered or not or this data had enough filtering.

I am a statistician, i need to know about these kind of things, if there is a paper or book can help me more in vcf tools and format, it will be helpful.

Thanks.

next-gen snp vcf R • 4.4k views
ADD COMMENT
4
Entering edit mode
10.0 years ago
  1. Indel Gap, in context of filtering refers to minimum distance between an Indel and a SNP. Indels can cause mapping artifacts and may generate false positive SNPs nearby. Thus, SNPs that lie in the vicinity of an indel are filtered. I use 20 bp threshold. So in my filtered VCF file, there will be no SNPs within 20 bp of Indel.
  2. SNP gap is a similar concept. If you see a cluster of SNPs within a short window, then its highly likely that all of them are false positives. For example, if you have a 20 bp region with 3 or more SNPs then it is highly likely that they are false positives.
  3. I dont know which threshold you are talking about. Is it the variant quality score or minimum base quality score? Again different variant calling tools like GATK and Samtools produce different range of variant quality scores and people use different cutoffs for different tools. For example, GATK suggests using a variant quality score of 30 or more.
ADD COMMENT

Login before adding your answer.

Traffic: 2416 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6