I have BAMs from which I am trying to detect reliable variants, to find differences between two strains of Aspergillus. I have used HaplotypeCaller to call call SNPs and small InDels, and Pindel to call longer InDels and other structural variants. I understand that filtering is always a balancing act and there may be no objectively best answer to this question, but any advice or experience anyone has would be appreciated. Also, feel free to criticise any of the background I give you!
Cobbling together ideas from here and there, I have used the following filters on my HaplotypeCaller results (incidentally, do you like them?):
--filterExpression " QD < 2.0 " --filterName "LowQD" --filterExpression " FS > 60.0 " --filterName "HighFS" --filterExpression " SOR > 4.0 " --filterName "HighSOR" --filterExpression " MQ < 40.0 " --filterName "LowMQ" --filterExpression " DP < 10 " --filterName "LowDP" --filterExpression "MQRankSum < -12.5 " --filterName "LowMQRankSum" --filterExpression "ReadPosRankSum < -8.0 " --filterName "LowReadPosRankSum"
for short InDels:
--filterExpression " QD < 2.0 " --filterName "LowQD" --filterExpression " FS > 200.0 " --filterName "HighFS" --filterExpression " SOR > 10.0 " --filterName "HighSOR" --filterExpression " DP < 10 " --filterName "LowDP" --filterExpression "ReadPosRankSum < -20.0 " --filterName "LowReadPosRankSum"
This reduces the number of short variants I’m dealing with from an already modest 2,065 to 1,965 (in one example sample).
However, I also care about larger variants, so I ran Pindel on the same BAMs. I used pretty much default parameters, and have not yet attempted to filter. I get a whopping 37,856 variants to deal with (admittedly this is across my two samples, so not totally comparable to my above figures, but we are clearly not in the same ball park). These are:
27410 Deletions 6821 Short insertions 1730 Inversions (INV) 1312 deletions with non-template sequences 583 Tandem duplications 0 Tandem duplications with non-template sequence (TD_NT) 0 Inversions with non-template sequence (INV_NT)
I guess to be meaningful and usable along side my short variants, they need some amount of filtering, but exactly how should I go about this, particularly with the range of different types of variant?
I've had a bit of a google and found several half-answers:
There seems to be talk online of just filtering by DP and sample number (eg https://www.biostars.org/p/70476/). How safe is that? I guess it still leaves me prone to many kinds of error, such as any that are introduced by mapping errors. Also I have just two samples...
I've had a quick look at AgilePindelFilter, which looks cool if you have more biological knowledge than I do. I'll let the biologists know about this when I pass back the work, but I’d like to reduce the numbers for them a bit before then, to make it something like equivalent to the HaplotypeCaller results.
There are also a couple of scripts available, without much documentation: Filter_Pindel_del_vcf.py, to filter out deletions (and insertions?) with a variant quality of <30; and Filter_Pindel_inv_vcf.py, to filter out inversions that are less than 100 bp (why do we want to do that?) and (/or?) with a variant quality of <30. do these look like good parameters? Also, what about the other variant types that Pindel reports (e.g. tandem duplication)?
Thanks very much.