22 months ago by
Spain, Barcelona, CRG
After much searching and discussions with colleagues there are several solutions that seem to work here.
Solution one - pure bcftools:
This is a bit of an indirect solution, as it relies on the AC (allele count) field to "guess" if there are ANY genotypes other than 0/0 for a given variant - it uses --min-ac option set to 1, which means that a variant has to have at least 1 non-reference allele to be retained.
--no-update parameter is optional, as I'd like to keep the INFO field unchanged. Incidentally, this is also the fasted solution as tested on 10K variants.
time bcftools view --samples-file samples.to.keep.txt --min-ac=1 --no-update test.vcf.gz > subset.bcftools.vcf
Solution two - pure vcftools
Turns out that vcftools has exactly the parameter for this job: --non-ref-ac-any. So if you set it to 1, and at least one sample has a non-reference allele, the variant will be retained. I guess 1 can be substituted with a range, but I haven't tested it extensively. --recode makes the command to output a vcf file and --recode-INFO-all keeps the INFO field unchanged. This solutions is substantially slower than bcftools above.
time vcftools --gzvcf test.vcf.gz --keep samples.to.keep.txt --non-ref-ac-any 1 --recode --recode-INFO-all --stdout > subset.vcftools.vcf
Solution three - vcftools + grep
This is more for completeness sake, as the two above solutions are perfectly adequate. But it might be useful for starting with an already sample-subsetted vcf file. It's a two-step solution: first I sub-select the required samples using vcftools (I could equally use bcftools here). The result is then piped to grep (or egrep in this case) that collects the lines that have at least one genotype that is not "0/0". There are four conditions for the grep command that cover all possible genotypes plus the '#' for the header lines. This is slightly slower than the pure vcf solution.
time vcftools --gzvcf test.vcf --keep samples.to.keep.txt --recode --recode-INFO-all --stdout | egrep "0[/|][1-9]|[1-9][/|]0|[1-9][/|][1-9]|#" > subset.grep.vcf