Entering edit mode
9.2 years ago
dadoudou
▴
10
I am doing a resequencing work of 150 populations. After calling SNP and Indel with GATK and filteration, I found that 20% SNP/Indels locate less than 20bp around others. I don't know whether it is reasonable. If not reasonable, what should I do next?
I also think about using "vcftools -thin" to thin SNP/Indels. But it seems too simple and rude.
Did you try GATK IndelRealigner? It realigns reads around indels to minimize false positives mismatches that can be called as SNPs by a variant caller. Check this link: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php. A sloppy alternative would be to remove SNPs within 10 or 20 bp of Indels from the vcf file. I would prefer realigning around the indels first and then calling for variants.
Thank you for your comments. Yeah, before calling, I have done realigning with GATK IndelRealigner. Your suggestion inspired me. I found bcftools have two parameters --SnpGap and --IndelGap. But How big the parameters are suggested?
I answered a similar post before. These thresholds are subjective. You can find them here: what is the properties of filtering the vcf files
150 populations. How many samples? With a few thousands of samples from diverse populations, an average distance ~50bp is expected.