I'm a Master's student doing bioinformatics for the first time. I have generated a .vcf file of SNPs from whole genome sequencing data using the GATK pipeline. However I'm using a non-model organism (Aulorhynchus flavidus) so I'm missing the true SNP dataset to do GATK's VQSR. My alternative is to hard filter the variants to remove lower quality SNPs, but I'm not sure how stringently to set my filters.
- What is your approach to choosing the thresholds for hard filtering?
- How many SNPs should I have to minimize noise (i.e. errors) and maximize signal?
- Can you point out any good resources on filtering variants?
I know this is an open ended question, and any answer is circumstantial. However without experience, I'm at a lose on how to start. Any advice or references would help a lot.