Remove positions that are non-variant in a subset of samples from a vcf file
Entering edit mode
6.9 years ago

Hi all,

I have a multi-sample vcf file and I would like to produce a new vcf file containing a subset of samples, while excluding the variants that don't have any alternative allele genotypes in that subset of samples (i.e. exclude all the variants with genotype 0/0). I can use vcf tools to extract the samples of interest and I can also extract the genotypes for these samples. But I can't seem to find a filtering option (if there is one) that would allow me to exclude variants with all 0/0 genotypes.

Is there a standard way of doing this, or would I need to concoct something custom?

Many thanks!

vcf vcftools genotype filtering bcftools • 14k views
Entering edit mode
6.8 years ago

After much searching and discussions with colleagues there are several solutions that seem to work here.

Solution one - pure bcftools:

This is a bit of an indirect solution, as it relies on the AC (allele count) field to "guess" if there are ANY genotypes other than 0/0 for a given variant - it uses --min-ac option set to 1, which means that a variant has to have at least 1 non-reference allele to be retained.

--no-update parameter is optional, as I'd like to keep the INFO field unchanged. Incidentally, this is also the fasted solution as tested on 10K variants.

time bcftools view --samples-file --min-ac=1 --no-update test.vcf.gz > subset.bcftools.vcf
real    0m16.232s
user    0m15.848s
sys     0m0.060s

Solution two - pure vcftools

Turns out that vcftools has exactly the parameter for this job: --non-ref-ac-any. So if you set it to 1, and at least one sample has a non-reference allele, the variant will be retained. I guess 1 can be substituted with a range, but I haven't tested it extensively. --recode makes the command to output a vcf file and --recode-INFO-all keeps the INFO field unchanged. This solutions is substantially slower than bcftools above.

time vcftools --gzvcf test.vcf.gz --keep --non-ref-ac-any 1 --recode --recode-INFO-all --stdout > subset.vcftools.vcf
real    0m52.590s
user    0m52.309s
sys     0m0.033s

Solution three - vcftools + grep

This is more for completeness sake, as the two above solutions are perfectly adequate. But it might be useful for starting with an already sample-subsetted vcf file. It's a two-step solution: first I sub-select the required samples using vcftools (I could equally use bcftools here). The result is then piped to grep (or egrep in this case) that collects the lines that have at least one genotype that is not "0/0". There are four conditions for the grep command that cover all possible genotypes plus the '#' for the header lines. This is slightly slower than the pure vcf solution.

time vcftools --gzvcf test.vcf --keep --recode --recode-INFO-all --stdout | egrep "0[/|][1-9]|[1-9][/|]0|[1-9][/|][1-9]|#" > subset.grep.vcf
real    1m4.191s
user    1m3.991s
sys     0m0.056s
Entering edit mode
6.8 years ago
Chris Whelan ▴ 570

You should be able to just use the --excludeNonVariants flag for GATK SelectVariants.

Entering edit mode

Thank you Chris. I'll give it a try, once I get GATK to run.

Entering edit mode
6.8 years ago

GATK SelectVariants


  • -sn (sample name) option
  • -removeUnusedAlternates .
  • -selectexpressions 'vc.getAlternateAlleles().isEmpty())

(not tested)

Entering edit mode

Thanks! I had the suggestion for GATK SelectVariants from somebody else as well, but wasn't able to get past making it to run :( So still not tested, but it looks promising.

Entering edit mode
5.3 years ago
maduh17 ▴ 10

You can also use awk, but be careful which columns:

cat raw_variants.vcf | \
awk '($0~/^#/)($0!~/^#/){split($10,x,":"); if(x[1]!="0/0") print }' > hard_filtered.vcf

Login before adding your answer.

Traffic: 1073 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6