Building vcf from most common alleles
0
0
Entering edit mode
2.7 years ago
Eugene • 0

I have a vcf file that contains variant calls at many sites across multiple [haploid] samples (FreeBayes or GATK output). The sequenced samples are derivatives of a single strain, and have alternate alleles in common at many sites. I'd like to update the reference genome sequence I have using variants that are shared between a large proportion of samples.

I was thinking of going about this by creating a single vcf file containing the alternate allele at all sites in which the alternate allele is shared among e.g. 80% of my samples, and then using GATK's FastaAlternateReferenceMaker to update the reference sequence.

I was thinking I could start with something like:

vcftools --vcf all_samples.vcf --non-ref-af 0.8-1 --recode --recode-INFO-all --out common_vars

to select positions where at least 80% of samples have an alternative allele, but this creates a vcf file with every sample selected at those positions, whereas what I think I need is a 'consensus' vcf file with the most common alternative allele at each of those sites.

Is there a simple way to create such a vcf? Or more generally, a different way to create a consensus fasta sequence using call data from multiple samples?

vcf consensus variant • 551 views
ADD COMMENT

Login before adding your answer.

Traffic: 3058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6