I have a vcf file that contains variant calls at many sites across multiple [haploid] samples (FreeBayes or GATK output). The sequenced samples are derivatives of a single strain, and have alternate alleles in common at many sites. I'd like to update the reference genome sequence I have using variants that are shared between a large proportion of samples.
I was thinking of going about this by creating a single vcf file containing the alternate allele at all sites in which the alternate allele is shared among e.g. 80% of my samples, and then using GATK's FastaAlternateReferenceMaker to update the reference sequence.
I was thinking I could start with something like:
vcftools --vcf all_samples.vcf --non-ref-af 0.8-1 --recode --recode-INFO-all --out common_vars
to select positions where at least 80% of samples have an alternative allele, but this creates a vcf file with every sample selected at those positions, whereas what I think I need is a 'consensus' vcf file with the most common alternative allele at each of those sites.
Is there a simple way to create such a vcf? Or more generally, a different way to create a consensus fasta sequence using call data from multiple samples?