I have a vcf where there are many calls for SNPs that are the same across all my samples in contrast only to the reference, but I only want to keep SNPs that are variable within my set (I don't care about the reference). How do i get rid of those with vcf tools?
bcftools filter -e "INFO/AF=1.0" -O z -o output.vcf.gz input.vcf
You may want to ensure you don't throw out sites where one or two samples are called as 1/1 and the rest are no calls, in which case you could specify a minimum allele count. This is probably easiest to do with bcftools. For example, only exclude a variant if it has an AF of 1.0 AND at least 10 called alleles (e.g. 5 diploid genotypes):
What have you tried? Have you looked at the bcftools and vcftools manual pages?
Also, can you give us examples for sites you wish to pick and sites you wish to ignore, including the logic for picking them? The current description does not help us understand your requirement well.
I think I get what you want. For example, you have 6 yeast strains, 1 is a control and the other 5 represent conditions, and you're interested in how they all differ from each other or from the control. If they all differ equally from the reference (including the control) this is not an interesting variant because it is present in the control, and was thus already there when the experiment started. You want to ignore positions in your VCF file that are all equally different from the reference. One way to do this (I'm not sure it's the best way), is to use gatk
SelectVariantsand create a compound select statement using
getGenotype("sampleID1").isHomVar() && getGenotype("sampleID2").isHomVar()etc. to select positions that are variant across all your samples, and put these into a file. You can then use this file with
SelectVariantsa second time on your original VCF file to select positions with the
--discordanceargument. This will grab the locations which are not uniformly variant. Seems complicated, and like it should be simpler, but that's one way.