Remove invariant sites from a VCF file
1
1
Entering edit mode
6.8 years ago
aberry814 ▴ 80

I have a VCF file with ~30K sites across 131 samples. I am trying to make it include only variant sites, meaning I want to exclude loci where all of my 131 samples have the same genotype, regardless of what the reference allele is. I used GATK SelectVariants with the -env tag, but that only excludes sites where all samples are 0/0, not sites where all samples are 1/1 (homozygous reference.)

I am a pretty terrible coder and struggle to modify VCF files.

My question is: Does anybody have a script or know of a tool that can remove the entire site (line) if all 131 samples (columns?) have 1/1 in the genotype position? Or more generally, if all samples have the same genotype at that site, whether it be 0/0, 0/1, or 1/1 (GATK can do the 0/0 and 0/1, but if it's easier to kill 3 birds with one stone then no problem).

Thanks!

Alex

VCF SNP • 4.7k views
ADD COMMENT
4
Entering edit mode
6.8 years ago

using VCFFilterjs: http://lindenb.github.io/jvarkit/VCFFilterJS.html

 java -jar dist/vcffilterjs.jar -e 'function accept(v) {var g0= v.getGenotype(0);for(var i=1;i< v.getNSamples();i++) {if(!v.getGenotype(i).sameGenotype(g0)) return true;} return false;}accept(variant);'  input.vcf
ADD COMMENT
1
Entering edit mode

Thanks so much! It seems to have worked perfectly.

ADD REPLY
0
Entering edit mode

Hi Pierre! I am using your solution to get rid of the same sites as aberry814, but this does not seem to eliminate the positions for which all genotyped individuals are 1/1, 0/0 or 0/1 AND some individuals have missing data. I guess a simple modification could do it?

Thanks a lot in advance!

Begona

ADD REPLY

Login before adding your answer.

Traffic: 2710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6