Applying hard filters for variants
2
0
Entering edit mode
5.9 years ago

I am currently working on influenza virus and ebola virus. I have 45 virus samples, so I have 45 bam files aligned with the influenza reference genome.fa.

java -Xmx16g -Djava.io.tmpdir=$out_folder/tmp -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -nt 12 \ -dcov 10000 \ -glm BOTH \ -R influenza.fa \ -l INFO \ -o A_California_Influenza_Virus.raw.vcf \ --sample_ploidy 1 \$INPUT_BAM_FILES

I got the raw VCF file (A_California_Influenza_Virus.raw.vcf) for 45 samples in the single VCF. I have 1400 VCF records in the raw VCF file.

As per the GATK best practice pipeline research paper, I applied hard filtering option for small datasets.

_Is my VCF records small to go for hard filtering? _

Then I selected snps alone in a separate VCF file.

java -jar /data1/software/gatk/current/GenomeAnalysisTK.jar -T SelectVariants -R A_California_Influenza_Virus_H1N1.fa -V A_California_Influenza_Virus.raw.vcf -selectType SNP -o VariantFiltering/A_California_Influenza_Virus.raw.snps.vcf

Then I applied hard filtering for SNPs.
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R A_California_Influenza_Virus_H1N1.fa -V VariantFiltering/A_California_Influenza_Virus.raw.snps.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "myfilter1" -o VariantFiltering/A_California_Influenza_Virus.filtered.snps.vcf

I understand that the variants matching the above conditions are bad variants.
What does QD < 2.0 mean?
What does FS > 60.0 means?
What does MQ < 40.0 ?
What does MQRankSum < -12.5?
What is the threshold value of high confidence variants for QD, FS, MQ, MQRankSum, ReadPosRankSum, DP?

snps variants variant calling GATK • 6.8k views
0
Entering edit mode
0
Entering edit mode
3 months ago
smeeta • 0
1. Determine parameters for filtering SNPs

SNPs matching any of these conditions will be considered bad and filtered out, i.e. marked with a filter name (which you specify in the filtering command) in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

QualByDepth (QD) 2.0 This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

FisherStrand (FS) 60.0 Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

RMSMappingQuality (MQ) 40.0 This is the Root Mean Square of the mapping quality of the reads across all samples.

MappingQualityRankSumTest (MQRankSum) -12.5 This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

ReadPosRankSumTest (ReadPosRankSum) -8.0 This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

StrandOddsRatio (SOR) 3.0 The StrandOddsRatio annotation is one of several methods that aims to evaluate whether there is strand bias in the data. Higher values indicate more strand bias.