The GATK documentation for best-practices genotype calling recommends filtering called variants. Usually, you should use an adaptive filter that learns from a set of known good variants. If you do not have enough data for that (or, presumably, if you do not have a database of known variants for your organism), it is recommended to use explicit filtering on certain attributes with fixed thresholds, using the VariantFiltration operation.
In these circumstances, the best practices document says:
For SNPs *
DATA_TYPE_SPECIFIC_FILTERSshould be "QD < 2.0", "MQ < 40.0", "FS > 60.0", "HaplotypeScore > 13.0", "MQRankSum < -12.5", "ReadPosRankSum < -8.0".
For Indels *
DATA_TYPE_SPECIFIC_FILTERSshould be "QD < 2.0", "ReadPosRankSum < -20.0", "InbreedingCoeff < -0.8", "FS > 200.0".
DATA_TYPE_SPECIFIC_FILTERS specified on the command line? The VariantFiltration documentation does not list it as an option. Could you give an example command showing how these recommended filters can be passed to VariantFiltration?
This clears things up significantly. However, I still can't find how to apply a filter only to SNPs or only to indels. Do I need to use the JEXL expressions
vc.getType() == Type.INDELand
vc.getType() == Type.SNP || vc.getType() == Type.MNPto select the appropriate kinds of variants? Is the
Typeenum even accessible from JEXL? Is there a better way?
EDIT: Also, the overall semantics are still unspecified. Do I want to remove variants matching any of the criteria? Remove those matching all the criteria? Keep only those that match at least one criterion? Or keep only those that match all the criteria? I would guess I'm supposed to remove variants that match any of the criteria for their particular variant type, but I'm not sure.
Sorry to reply so late -- for best support, I recommend posting your question directly on the GATK forum: http://gatkforums.broadinstitute.org/