Tutorial: GATK method for filtering vcf lines using GT values at all or multisample level.
2
gravatar for kirannbishwa01
2.1 years ago by
kirannbishwa011.0k
United States
kirannbishwa011.0k wrote:

Below is a short tutorial for filtering vcf lines using GT values at all or multisample level. Filtration of vcf using genotype values at single sample level is mostly explain at GATK, but filtration of vcf sites at population level (multisample genotype annotation) is developing. SnpSift and GATK have methods to do it, but less examples are explained.

I just had to do some filtering of my own data and tested several methods to see what the outcome would be. I am sharing those here for others and myself, if it is ever needed again.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by kirannbishwa011.0k

see also: http://lindenb.github.io/jvarkit/VCFFilterJS.html

ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum121k
3
gravatar for kirannbishwa01
2.1 years ago by
kirannbishwa011.0k
United States
kirannbishwa011.0k wrote:

I just had to do some filtering of my own data and tested several methods (to see what the outcome would be). I am sharing those here for others and myself, if it is ever needed again.

Tutorials .... !!!

- method to call the sites that are all homozygous Variants (1/1, 2/2 genotype)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == 6'
# outcome: no sites will have ambigous (./.) and/or homRef (0/0). All other homVar are allowed

- in the above script if we want the site where at least 4 samples are homozygous ref or variants

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 4'
# outcome: atleast 4 samples are homVar, rest can be any genotype (./., 0/0, 0/1, etc.)

- at least 1 sample is homVar (1/1, 2/2 or 3/3)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() != 0'

- all the samples are homRef (0/0 genotypes)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == 6'

Note: 6 chr = 3 samples (when samples are diploid)

- sites that are all HomVar or nocalls (./.)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == vc.getCalledChrCount()/2'
# outcome: no 0/0, 0/1
# Note: samples with GT = ./. are not counted as called chr

Useful after samples are sub-sampled from multisample vcfs

- sites where no genotypes were called for all the samples in the selected vcf - or say, select vcf sites that are not called in any sample (i.e don't have any genotypes, only contain ./.)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 0'
# outcome = vcf sites (lines) that are all no calls (GT = ./.) on all samples are selected

- samples are either homRef (0/0) or no call (./.)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == vc.getCalledChrCount()/2'
# outcome: vcf sites where all the samples are GT = ./. (no call) or GT = 0/0

- site where 6 chromosomes (exact) or 3 samples were called for diploid genome

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 6'
# Note of Caution: 6 chr = 3 samples (when samples are diploid)

- site where at least 3 samples (diploid) are called

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() >=6'

- if I have 6-sample vcf

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() ==12'
# 12 haploid chr = 6 samples (diploid)
# outcome: sites that were called in all samples, there will be no GT = ./.

- select vcf sites where no sample has hetVar (no GT = 0/1, 0/2), only GT = ./., 0/0, 1/1, 2/2

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0'

- select vcf sites where no sample has hetVar (no Gt = 0/1, 0/2), and at least one sample is HomVar (GT = 1/1)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0 && vc.getHomVarCount() >= 1'

- site where at least 1 sample is homVar (GT = 1/1 and/or 2/2 etc.)

java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 1'

Other updates to follow. Thanks,

ADD COMMENTlink written 2.1 years ago by kirannbishwa011.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 543 users visited in the last hour