Building vcf from most common alleles
Entering edit mode
11 weeks ago
Eugene • 0

I have a vcf file that contains variant calls at many sites across multiple [haploid] samples (FreeBayes or GATK output). The sequenced samples are derivatives of a single strain, and have alternate alleles in common at many sites. I'd like to update the reference genome sequence I have using variants that are shared between a large proportion of samples.

I was thinking of going about this by creating a single vcf file containing the alternate allele at all sites in which the alternate allele is shared among e.g. 80% of my samples, and then using GATK's FastaAlternateReferenceMaker to update the reference sequence.

I was thinking I could start with something like:

vcftools --vcf all_samples.vcf --non-ref-af 0.8-1 --recode --recode-INFO-all --out common_vars

to select positions where at least 80% of samples have an alternative allele, but this creates a vcf file with every sample selected at those positions, whereas what I think I need is a 'consensus' vcf file with the most common alternative allele at each of those sites.

Is there a simple way to create such a vcf? Or more generally, a different way to create a consensus fasta sequence using call data from multiple samples?

vcf consensus variant • 154 views

Login before adding your answer.

Traffic: 2072 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6