Hi all, I recently implemented GATK pipeline on a small set of samples (WES) and did hard filtering as I had only 10 samples and got the GATK to spit out the metrics file. I used the default filtering parameters for the first pass and this is the command:
gatk VariantFiltration \ -V myfam_output_sorted_annotated.snps.vcf \ -filter "QD < 2.0" --filter-name "FILTER_QD2" \ -filter "QUAL < 30.0" --filter-name "FILTER_QUAL30" \ -filter "SOR > 3.0" --filter-name "FILTER_SOR3" \ -filter "FS > 60.0" --filter-name "FILTER_FS60" \ -filter "MQ < 40.0" --filter-name "FILTER_MQ40" \ -filter "MQRankSum < -12.5" --filter-name "FILTER_MQRS-12.5" \ -filter "ReadPosRankSum< -8.0" --filter-name "FILTER_RPRS-8" \ -O myfam_output_sorted_filtered.snps.vcf
after implementing "CollectVariantCallingMetrics" the metrics file gave me these values:
TOTAL_INDELS DBSNP_INS_DEL_RATIO NOVEL_INS_DEL_RATIO 7902 0.811012 0.528926 TOTAL_SNPS DBSNP_TITV NOVEL_TITV 49556 2.286389 1.522989
My question : 1. Do I need to be concerned with these numbers I got ? These do not fall in the suggested metrics from GATK given below:
Filtering for Indel Ratio common ~1 rare 0.2-0.5 Sequencing Type # of Variants* TiTv Ratio WGS ~4.4M 2.0-2.1 WES ~41k 3.0-3.3
Do I need to change the filtering parameter? as it looks like I may have high false positives?
Why is dbsnp TiTv that low? I used dbsnp file from GATK and extracted only the chr of interests and made a subset dbsnp file and used it for CollectVariantCallingMetrics.
above all, do I even need to break my head too much about these numbers? If I can view my variants on IGV in comparison to bam and then also view them in variant viewer with clinvar and dbsnp for those specific regions and I can see validation from those databases on some of these snps would that not be robust enough? My point is how much do we relay on these numbers and to what extent we keep on filtering and polishing these?
Any suggestion helps, Thankyou !