I would like to know how to filter a VCF by number of individuals with genotype data (i.e. without missing data).
First, some background:
I have a large VCF file containing SNP data for 95 individuals called from bcftools mpileup output. BAMs passed to mpileup contained data for alignments of RNAseq reads to a de novo transcriptome. Thus, the VCF lists SNPs within each contig defined in the reference de novo transcriptome.
The 95 individuals were collected from 6 different populations. For population genetic analysis, I would like to produce a VCF that only contains SNPs at which there are data for at least 75% of individuals in each population (11 or 12).
I have a decent familiarity with bcftools, and know how to filter missing data, but I am not sure how to approach this. As expected, filtering on the N_SAMPLES variable alone doesn't seem to modify the VCF. Perhaps my best course of action would be to first split the VCF into population-specific files?
Any help with this would be greatly appreciated! Please let me know if I've failed to include any useful information.