Question: Gatk Multi-Sample Vcf Variantfiltration
4
gravatar for William
6.3 years ago by
William4.5k
Europe
William4.5k wrote:

How does GATK VariantFiltration work on multi-sample vcf files?

VariantFiltration is used to annotate likely false positive SNP's based on certain formula's:

--filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)"  --filterName "HARD_TO_VALIDATE" 

--filterExpression "DP < 5 " --filterName "LowCoverage" 

--filterExpression "QD < 1.5 " --filterName "LowQD" 

--filterExpression "SB > -10.0 " --filterName "StrandBias"

--filterExpression "QUAL > 30.0 && QUAL < 50.0 " --filterName "LowQual" 

--clusterWindowSize 10

It is easy enough to understand how this works on single sample VCF files, but how does this work on multi-sample vcf files?

For example the low coverage filter, will it annotate the SNP low coverage if

a) all of the samples combined in total have less than 5 reads for the SNP

b) each of the samples has less than 5 reads

c) one or more of the samples has less than 5 reads.

The same for the MQ0, QD and SB annotation. Are they set when:

a) all of the samples combined reach the threshold

b) each sample it self reaches the threshold

c) or one or more of the samples reach the threshold

The lowqual and snpcluster annotation are set I guess based on all samples combined.

vcf gatk • 7.5k views
ADD COMMENTlink modified 4.7 years ago by Biostar ♦♦ 20 • written 6.3 years ago by William4.5k
2
gravatar for Pierre Lindenbaum
6.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

not a direct answer to your question: I wrote a java tools filtering the VCF with javascript. See

https://github.com/lindenb/jvarkit/wiki/VCFFilterJS

With java script, you'll be able to declare some functions and run some loops. It uses the javascript-rhino technology: http://en.wikipedia.org/wiki/Rhino_%28JavaScript_engine%29

for example, say you want to keep the variations from https://raw.github.com/jamescasbon/PyVCF/master/vcf/test/gatk.vcf having at leat two samples with DP < 200. The script would be:

function myfilterFunction()
    {
    var samples=header.genotypeSamples;
    var countOkDp=0;


    for(var i=0; i< samples.size();++i)
        {
        var sampleName=samples.get(i);
        if(! variant.hasGenotype(sampleName)) continue;
        var genotype = variant.genotypes.get(sampleName);
        if( ! genotype.hasDP()) continue;
        var dp= genotype.getDP();
        if(dp < 200 ) countOkDp++;
        }
    return (countOkDp>2)
    }
myfilterFunction();

Example:

$ curl -s "https://raw.github.com/jamescasbon/PyVCF/master/vcf/test/gatk.vcf" | java -jar  dist/vcffilterjs.jar  -f filter.js | grep -v "##"

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    BLANK    NA12878    NA12891    NA12892    NA19238    NA19239    NA19240
chr22    42526449    .    T    A    151.47    .    AC=1;AF=0.071;AN=14;BaseQRankSum=2.662;DP=1226;DS;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=41.2083;MQ=240.47;MQ0=0;MQRankSum=0.578;QD=4.89;ReadPosRankSum=3.611    GT:AD:DP:GQ:PL    0/1:23,8:31:99:190,0,694    0/0:188,0:190:99:0,478,5376    0/0:187,0:187:99:0,493,5322    0/0:247,0:249:99:0,634,6728    0/0:185,0:185:99:0,487,5515    0/0:202,0:202:99:0,520,5857    0/0:181,1:182:99:0,440,5362
chr22    42526634    .    T    C    32.60    .    AC=1;AF=0.071;AN=14;BaseQRankSum=1.147;DP=1225;DS;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=50.0151;MQ=240.65;MQ0=0;MQRankSum=1.151;QD=1.30;ReadPosRankSum=1.276    GT:AD:DP:GQ:PL    0/1:21,4:25:71:71,0,702    0/0:187,2:189:99:0,481,6080    0/0:233,0:233:99:0,667,7351    0/0:230,0:230:99:0,667,7394    0/0:174,1:175:99:0,446,5469    0/0:194,2:196:99:0,498,6239    0/0:174,0:175:99:0,511,5894
chr22    42527793    rs1080989    C    T    3454.66    .    AC=2;AF=0.167;AN=12;BaseQRankSum=-3.007;DB;DP=1074;DS;Dels=0.01;FS=0.000;HRun=1;HaplotypeScore=75.7865;MQ=209.00;MQ0=0;MQRankSum=3.014;QD=9.36;ReadPosRankSum=0.618    GT:AD:DP:GQ:PL    ./.    0/1:72,90:162:99:1699,0,1767    0/1:103,96:202:99:1756,0,2532    0/0:188,0:188:99:0,526,5889    0/0:160,0:160:99:0,457,4983    0/0:197,0:198:99:0,544,6100    0/0:156,0:156:99:0,439,5041
ADD COMMENTlink modified 5.5 years ago • written 6.3 years ago by Pierre Lindenbaum122k

Thanks for providing this tool. I modified script file filter.js to filter based on GQ but it's giving error. javax.script.ScriptException: sun.org.mozilla.javascript.internal.EcmaError: ReferenceError: "GQ" is not defined. (<unknown source="">#16) in <unknown source=""> at line number 16

ADD REPLYlink written 6.1 years ago by pankajxyz0
0
gravatar for hannes.svardal
5.6 years ago by
Austria
hannes.svardal140 wrote:

As I understand the correct answer is a) all of the samples combined in total have less than 5 reads for the SNP

It uses the DP flag in the info column of the VCF file, and this is the combined depth over all samples. Note, however, that it can be smaller than the total number of raw reads that map there because the GATK genotypers apply some read filtering before SNP calling.

Also the other statistics are calculated over all samples, as I understand.

ADD COMMENTlink written 5.6 years ago by hannes.svardal140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1651 users visited in the last hour