Tutorial: GATK method for filtering vcf lines using GT values at all or multisample level.
2
gravatar for kirannbishwa01
22 months ago by
United States
kirannbishwa01950 wrote:

Below is a short tutorial for filtering vcf lines using GT values at all or multisample level. Filtration of vcf using genotype values at single sample level is mostly explain at GATK, but filtration of vcf sites at population level (multisample genotype annotation) is developing. SnpSift and GATK have methods to do it, but less examples are explained.

I just had to do some filtering of my own data and tested several methods to see what the outcome would be. I am sharing those here for others and myself, if it is ever needed again.

ADD COMMENTlink modified 21 months ago • written 22 months ago by kirannbishwa01950

see also: http://lindenb.github.io/jvarkit/VCFFilterJS.html

ADD REPLYlink written 21 months ago by Pierre Lindenbaum118k
3
gravatar for kirannbishwa01
21 months ago by
United States
kirannbishwa01950 wrote:

I just had to do some filtering of my own data and tested several methods (to see what the outcome would be). I am sharing those here for others and myself, if it is ever needed again.

Tutorials .... !!!

- method to call the sites that are all homozygous Variants (1/1, 2/2 genotype)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == 6'
# outcome: no sites will have ambigous (./.) and/or homRef (0/0). All other homVar are allowed

- in the above script if we want the site where at least 4 samples are homozygous ref or variants

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 4'
# outcome: atleast 4 samples are homVar, rest can be any genotype (./., 0/0, 0/1, etc.)

- at least 1 sample is homVar (1/1, 2/2 or 3/3)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() != 0'

- all the samples are homRef (0/0 genotypes)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == 6'

Note: 6 chr = 3 samples (when samples are diploid)

- sites that are all HomVar or nocalls (./.)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == vc.getCalledChrCount()/2'
# outcome: no 0/0, 0/1
# Note: samples with GT = ./. are not counted as called chr

Useful after samples are sub-sampled from multisample vcfs

- sites where no genotypes were called for all the samples in the selected vcf - or say, select vcf sites that are not called in any sample (i.e don't have any genotypes, only contain ./.)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 0'
# outcome = vcf sites (lines) that are all no calls (GT = ./.) on all samples are selected

- samples are either homRef (0/0) or no call (./.)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == vc.getCalledChrCount()/2'
# outcome: vcf sites where all the samples are GT = ./. (no call) or GT = 0/0

- site where 6 chromosomes (exact) or 3 samples were called for diploid genome

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 6'
# Note of Caution: 6 chr = 3 samples (when samples are diploid)

- site where at least 3 samples (diploid) are called

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() >=6'

- if I have 6-sample vcf

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() ==12'
# 12 haploid chr = 6 samples (diploid)
# outcome: sites that were called in all samples, there will be no GT = ./.

- select vcf sites where no sample has hetVar (no GT = 0/1, 0/2), only GT = ./., 0/0, 1/1, 2/2

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0'

- select vcf sites where no sample has hetVar (no Gt = 0/1, 0/2), and at least one sample is HomVar (GT = 1/1)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0 && vc.getHomVarCount() >= 1'

- site where at least 1 sample is homVar (GT = 1/1 and/or 2/2 etc.)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 1'

Other updates to follow. Thanks,

ADD COMMENTlink written 21 months ago by kirannbishwa01950
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1070 users visited in the last hour