I just had to do some filtering of my own data and tested several methods (to see what the outcome would be). I am sharing those here for others and myself, if it is ever needed again.
Tutorials .... !!!
- method to call the sites that are all homozygous Variants (1/1, 2/2 genotype)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == 6'
# outcome: no sites will have ambigous (./.) and/or homRef (0/0). All other homVar are allowed
- in the above script if we want the site where at least 4 samples are homozygous ref or variants
java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 4'
# outcome: atleast 4 samples are homVar, rest can be any genotype (./., 0/0, 0/1, etc.)
- at least 1 sample is homVar (1/1, 2/2 or 3/3)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() != 0'
- all the samples are homRef (0/0 genotypes)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == 6'
Note: 6 chr = 3 samples (when samples are diploid)
- sites that are all HomVar or nocalls (./.)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == vc.getCalledChrCount()/2'
# outcome: no 0/0, 0/1
# Note: samples with GT = ./. are not counted as called chr
Useful after samples are sub-sampled from multisample vcfs
- sites where no genotypes were called for all the samples in the selected vcf
- or say, select vcf sites that are not called in any sample (i.e don't have any genotypes, only contain ./.)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 0'
# outcome = vcf sites (lines) that are all no calls (GT = ./.) on all samples are selected
- samples are either homRef (0/0) or no call (./.)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == vc.getCalledChrCount()/2'
# outcome: vcf sites where all the samples are GT = ./. (no call) or GT = 0/0
- site where 6 chromosomes (exact) or 3 samples were called for diploid genome
java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 6'
# Note of Caution: 6 chr = 3 samples (when samples are diploid)
- site where at least 3 samples (diploid) are called
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() >=6'
- if I have 6-sample vcf
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() ==12'
# 12 haploid chr = 6 samples (diploid)
# outcome: sites that were called in all samples, there will be no GT = ./.
- select vcf sites where no sample has hetVar (no GT = 0/1, 0/2), only GT = ./., 0/0, 1/1, 2/2
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0'
- select vcf sites where no sample has hetVar (no Gt = 0/1, 0/2), and at least one sample is HomVar (GT = 1/1)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0 && vc.getHomVarCount() >= 1'
- site where at least 1 sample is homVar (GT = 1/1 and/or 2/2 etc.)
java -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 1'
- where a site is only homRef for a particular sample and is only SNP
java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select 'vc.getGenotype("ms01e").isHomRef()' -O test_variants.ms01e.Chr_2_3.vcf
- where a site is only homVar for a particular sample and is only SNP
java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select 'vc.getGenotype("ms01e").isHomVar()' -O test_variants.ms01e.Chr_2_3.vcf
- where a site is not homRef for a particular sample and is only SNP. This will also select empty sites
java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select '! vc.getGenotype("ms01e").isHomRef()' -O test_variants.ms01e.Chr_2_3.vcf
NOTE: following genotype call methods won't work:
# to get only heterozygous sites
-select 'vc.getGenotype("ms01e").isHetVar()
# to get any variant sites
-select 'vc.getGenotype("ms01e").isVar()
# won't work if getGenotype method is combined
-select 'vc.getGenotype("ms01e").isHomVar() || vc.getGenotype("ms01e".isHomRef()'
So, I figured out the following methods where count helps with selecting particular site
- where a site is a heterozygous variant site for a particular sample
java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select 'vc.getHetCount() == 1' -O test_variants.ms01e.vcf
NOTE: 'vc.getGenotype("ms01e").isHetVar()' wont work so generated the above expression.
- where a site is a variant site (either homVar or hetVar) for a particular sample and is only SNP
java -jar GenomeAnalysisTK.jar SelectVariants -R lyrata_genome.fa -V phased.MySpF1.vcf -sn ms01e -select-type SNP -select 'vc.getHetCount() == 1 || vc.getHomVarCount() == 1' -O test_variants.ms01e.vcf
NOTE: 'vc.getGenotype("NA12878").isHomRef()' || vc.getGenotype("NA12878").isHomVar()' wont work together, so generated this above expression.
Other updates to follow.
Thanks,
see also: http://lindenb.github.io/jvarkit/VCFFilterJS.html