I have generated a VCF file as an output from the GATK UnifiedGenotyper. However, I have quite a lot of missing data in my data set. Does anyone know a way of selecting SNPs that are, say, represented by at least 80% of my samples and therefore excluding any that are below this?
Also, I have multiple SNPs per contig, I would like to get a set of SNPs where there is only one per contig to reduce the effects of linkage. Does anyone know a way of randomly selecting one SNP per contig or, for example, selecting the SNP with best coverage/quality score per contig to leave me with a dataset where each SNP is from a separate contig?