As one of the last steps of a large variant analysis, I would like to identify nocalls and mask them/pass them using --snpmask to GATK to replace sites that cannot be confidently called. Not terribly tricky.
I decided the best way to go about this was to generate calls at all sites using EMIT_ALL_SITES in GATK. I then use a short Perl one-liner to identify sites with nocalls (./.), subset the VCF so that it includes only such sites, and pass the result back to GATK one last time with --snpmask. The whole approach from raw data to sequences, including a bunch more that I won't go into here, is currently implemented in Python.
I was filtering the VCF using VCFtools by passing it the positions file generated above. However, the call to VCFtools takes over 100 GB of RAM, and I would like to do multiple samples simultaneously. The CPU time, too, is pretty large; this is one of the most resource-intensive steps, even outpacing indel realignment and variant calling.
My question: is there an alternative to subset the VCF taking my situation into account? Should I try creating a BED file from my list of positions (it would be something like chr1 0 1; chr1 1 2 for many of the sites) and letting tabix take a crack at it? Are there other options I ought to consider? It's even okay if it takes a bit longer but has a smaller memory footprint.