Question

Building vcf from most common alleles

0

Entering edit mode

2.7 years ago

Eugene • 0

I have a vcf file that contains variant calls at many sites across multiple [haploid] samples (FreeBayes or GATK output). The sequenced samples are derivatives of a single strain, and have alternate alleles in common at many sites. I'd like to update the reference genome sequence I have using variants that are shared between a large proportion of samples.

I was thinking of going about this by creating a single vcf file containing the alternate allele at all sites in which the alternate allele is shared among e.g. 80% of my samples, and then using GATK's FastaAlternateReferenceMaker to update the reference sequence.

I was thinking I could start with something like:

vcftools --vcf all_samples.vcf --non-ref-af 0.8-1 --recode --recode-INFO-all --out common_vars

to select positions where at least 80% of samples have an alternative allele, but this creates a vcf file with every sample selected at those positions, whereas what I think I need is a 'consensus' vcf file with the most common alternative allele at each of those sites.

Is there a simple way to create such a vcf? Or more generally, a different way to create a consensus fasta sequence using call data from multiple samples?

vcf consensus variant • 551 views

ADD COMMENT • link 2.7 years ago by Eugene • 0