Question: Remove positions that are non-variant in a subset of samples from a vcf file
5
gravatar for Aliaksei Holik
2.3 years ago by
Spain, Barcelona, CRG
Aliaksei Holik170 wrote:

Hi all,

I have a multi-sample vcf file and I would like to produce a new vcf file containing a subset of samples, while excluding the variants that don't have any alternative allele genotypes in that subset of samples (i.e. exclude all the variants with genotype 0/0). I can use vcf tools to extract the samples of interest and I can also extract the genotypes for these samples. But I can't seem to find a filtering option (if there is one) that would allow me to exclude variants with all 0/0 genotypes.

Is there a standard way of doing this, or would I need to concoct something custom?

Many thanks!

ADD COMMENTlink modified 9 months ago by maduh1710 • written 2.3 years ago by Aliaksei Holik170
12
gravatar for Aliaksei Holik
2.3 years ago by
Spain, Barcelona, CRG
Aliaksei Holik170 wrote:

After much searching and discussions with colleagues there are several solutions that seem to work here.

Solution one - pure bcftools:

This is a bit of an indirect solution, as it relies on the AC (allele count) field to "guess" if there are ANY genotypes other than 0/0 for a given variant - it uses --min-ac option set to 1, which means that a variant has to have at least 1 non-reference allele to be retained.

--no-update parameter is optional, as I'd like to keep the INFO field unchanged. Incidentally, this is also the fasted solution as tested on 10K variants.

time bcftools view --samples-file samples.to.keep.txt --min-ac=1 --no-update test.vcf.gz > subset.bcftools.vcf
real    0m16.232s
user    0m15.848s
sys     0m0.060s

Solution two - pure vcftools

Turns out that vcftools has exactly the parameter for this job: --non-ref-ac-any. So if you set it to 1, and at least one sample has a non-reference allele, the variant will be retained. I guess 1 can be substituted with a range, but I haven't tested it extensively. --recode makes the command to output a vcf file and --recode-INFO-all keeps the INFO field unchanged. This solutions is substantially slower than bcftools above.

time vcftools --gzvcf test.vcf.gz --keep samples.to.keep.txt --non-ref-ac-any 1 --recode --recode-INFO-all --stdout > subset.vcftools.vcf
real    0m52.590s
user    0m52.309s
sys     0m0.033s

Solution three - vcftools + grep

This is more for completeness sake, as the two above solutions are perfectly adequate. But it might be useful for starting with an already sample-subsetted vcf file. It's a two-step solution: first I sub-select the required samples using vcftools (I could equally use bcftools here). The result is then piped to grep (or egrep in this case) that collects the lines that have at least one genotype that is not "0/0". There are four conditions for the grep command that cover all possible genotypes plus the '#' for the header lines. This is slightly slower than the pure vcf solution.

time vcftools --gzvcf test.vcf --keep samples.to.keep.txt --recode --recode-INFO-all --stdout | egrep "0[/|][1-9]|[1-9][/|]0|[1-9][/|][1-9]|#" > subset.grep.vcf
real    1m4.191s
user    1m3.991s
sys     0m0.056s
ADD COMMENTlink written 2.3 years ago by Aliaksei Holik170
1
gravatar for Chris Whelan
2.3 years ago by
Chris Whelan540
Portland, OR
Chris Whelan540 wrote:

You should be able to just use the --excludeNonVariants flag for GATK SelectVariants.

ADD COMMENTlink written 2.3 years ago by Chris Whelan540

Thank you Chris. I'll give it a try, once I get GATK to run.

ADD REPLYlink written 2.3 years ago by Aliaksei Holik170
1
gravatar for Chris Whelan
2.3 years ago by
Chris Whelan540
Portland, OR
Chris Whelan540 wrote:

You should be able to just use the --excludeNonVariants flag for GATK SelectVariants.

ADD COMMENTlink written 2.3 years ago by Chris Whelan540
0
gravatar for Pierre Lindenbaum
2.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum114k wrote:

GATK SelectVariants https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

with

  • -sn (sample name) option
  • -removeUnusedAlternates .
  • -selectexpressions 'vc.getAlternateAlleles().isEmpty())

(not tested)

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Pierre Lindenbaum114k

Thanks! I had the suggestion for GATK SelectVariants from somebody else as well, but wasn't able to get past making it to run :( So still not tested, but it looks promising.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Aliaksei Holik170
0
gravatar for maduh17
9 months ago by
maduh1710
Kuala Lumpur, Malaysia
maduh1710 wrote:

You can also use awk, but be careful which columns:

cat raw_variants.vcf | \
awk '($0~/^#/)($0!~/^#/){split($10,x,":"); if(x[1]!="0/0") print }' > hard_filtered.vcf
ADD COMMENTlink written 9 months ago by maduh1710
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1924 users visited in the last hour