I have a VCF file with ~30K sites across 131 samples. I am trying to make it include only variant sites, meaning I want to exclude loci where all of my 131 samples have the same genotype, regardless of what the reference allele is. I used GATK SelectVariants with the -env tag, but that only excludes sites where all samples are 0/0, not sites where all samples are 1/1 (homozygous reference.)
I am a pretty terrible coder and struggle to modify VCF files.
My question is: Does anybody have a script or know of a tool that can remove the entire site (line) if all 131 samples (columns?) have 1/1 in the genotype position? Or more generally, if all samples have the same genotype at that site, whether it be 0/0, 0/1, or 1/1 (GATK can do the 0/0 and 0/1, but if it's easier to kill 3 birds with one stone then no problem).
Thanks!
Alex
Thanks so much! It seems to have worked perfectly.
Hi Pierre! I am using your solution to get rid of the same sites as aberry814, but this does not seem to eliminate the positions for which all genotyped individuals are 1/1, 0/0 or 0/1 AND some individuals have missing data. I guess a simple modification could do it?
Thanks a lot in advance!
Begona