Question: Gatk Multi-Sample Vcf Variantfiltration
gravatar for William
7.2 years ago by
William4.7k wrote:

How does GATK VariantFiltration work on multi-sample vcf files?

VariantFiltration is used to annotate likely false positive SNP's based on certain formula's:

--filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)"  --filterName "HARD_TO_VALIDATE" 

--filterExpression "DP < 5 " --filterName "LowCoverage" 

--filterExpression "QD < 1.5 " --filterName "LowQD" 

--filterExpression "SB > -10.0 " --filterName "StrandBias"

--filterExpression "QUAL > 30.0 && QUAL < 50.0 " --filterName "LowQual" 

--clusterWindowSize 10

It is easy enough to understand how this works on single sample VCF files, but how does this work on multi-sample vcf files?

For example the low coverage filter, will it annotate the SNP low coverage if

a) all of the samples combined in total have less than 5 reads for the SNP

b) each of the samples has less than 5 reads

c) one or more of the samples has less than 5 reads.

The same for the MQ0, QD and SB annotation. Are they set when:

a) all of the samples combined reach the threshold

b) each sample it self reaches the threshold

c) or one or more of the samples reach the threshold

The lowqual and snpcluster annotation are set I guess based on all samples combined.

vcf gatk • 8.0k views
ADD COMMENTlink modified 5.6 years ago by Biostar ♦♦ 20 • written 7.2 years ago by William4.7k
gravatar for Pierre Lindenbaum
7.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

not a direct answer to your question: I wrote a java tools filtering the VCF with javascript. See

With java script, you'll be able to declare some functions and run some loops. It uses the javascript-rhino technology:

for example, say you want to keep the variations from having at leat two samples with DP < 200. The script would be:

function myfilterFunction()
    var samples=header.genotypeSamples;
    var countOkDp=0;

    for(var i=0; i< samples.size();++i)
        var sampleName=samples.get(i);
        if(! variant.hasGenotype(sampleName)) continue;
        var genotype = variant.genotypes.get(sampleName);
        if( ! genotype.hasDP()) continue;
        var dp= genotype.getDP();
        if(dp < 200 ) countOkDp++;
    return (countOkDp>2)


$ curl -s "" | java -jar  dist/vcffilterjs.jar  -f filter.js | grep -v "##"

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    BLANK    NA12878    NA12891    NA12892    NA19238    NA19239    NA19240
chr22    42526449    .    T    A    151.47    .    AC=1;AF=0.071;AN=14;BaseQRankSum=2.662;DP=1226;DS;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=41.2083;MQ=240.47;MQ0=0;MQRankSum=0.578;QD=4.89;ReadPosRankSum=3.611    GT:AD:DP:GQ:PL    0/1:23,8:31:99:190,0,694    0/0:188,0:190:99:0,478,5376    0/0:187,0:187:99:0,493,5322    0/0:247,0:249:99:0,634,6728    0/0:185,0:185:99:0,487,5515    0/0:202,0:202:99:0,520,5857    0/0:181,1:182:99:0,440,5362
chr22    42526634    .    T    C    32.60    .    AC=1;AF=0.071;AN=14;BaseQRankSum=1.147;DP=1225;DS;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=50.0151;MQ=240.65;MQ0=0;MQRankSum=1.151;QD=1.30;ReadPosRankSum=1.276    GT:AD:DP:GQ:PL    0/1:21,4:25:71:71,0,702    0/0:187,2:189:99:0,481,6080    0/0:233,0:233:99:0,667,7351    0/0:230,0:230:99:0,667,7394    0/0:174,1:175:99:0,446,5469    0/0:194,2:196:99:0,498,6239    0/0:174,0:175:99:0,511,5894
chr22    42527793    rs1080989    C    T    3454.66    .    AC=2;AF=0.167;AN=12;BaseQRankSum=-3.007;DB;DP=1074;DS;Dels=0.01;FS=0.000;HRun=1;HaplotypeScore=75.7865;MQ=209.00;MQ0=0;MQRankSum=3.014;QD=9.36;ReadPosRankSum=0.618    GT:AD:DP:GQ:PL    ./.    0/1:72,90:162:99:1699,0,1767    0/1:103,96:202:99:1756,0,2532    0/0:188,0:188:99:0,526,5889    0/0:160,0:160:99:0,457,4983    0/0:197,0:198:99:0,544,6100    0/0:156,0:156:99:0,439,5041
ADD COMMENTlink modified 6.4 years ago • written 7.2 years ago by Pierre Lindenbaum129k

Thanks for providing this tool. I modified script file filter.js to filter based on GQ but it's giving error. javax.script.ScriptException: ReferenceError: "GQ" is not defined. (<unknown source="">#16) in <unknown source=""> at line number 16

ADD REPLYlink written 7.0 years ago by pankajxyz0
gravatar for hannes.svardal
6.5 years ago by
hannes.svardal180 wrote:

As I understand the correct answer is a) all of the samples combined in total have less than 5 reads for the SNP

It uses the DP flag in the info column of the VCF file, and this is the combined depth over all samples. Note, however, that it can be smaller than the total number of raw reads that map there because the GATK genotypers apply some read filtering before SNP calling.

Also the other statistics are calculated over all samples, as I understand.

ADD COMMENTlink written 6.5 years ago by hannes.svardal180
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 907 users visited in the last hour