Question: Remove positions that are non-variant in a subset of samples from a vcf file
3
gravatar for Aliaksei Holik
19 months ago by
Spain, Barcelona, CRG
Aliaksei Holik80 wrote:

Hi all,

I have a multi-sample vcf file and I would like to produce a new vcf file containing a subset of samples, while excluding the variants that don't have any alternative allele genotypes in that subset of samples (i.e. exclude all the variants with genotype 0/0). I can use vcf tools to extract the samples of interest and I can also extract the genotypes for these samples. But I can't seem to find a filtering option (if there is one) that would allow me to exclude variants with all 0/0 genotypes.

Is there a standard way of doing this, or would I need to concoct something custom?

Many thanks!

ADD COMMENTlink modified 9 days ago by maduh1710 • written 19 months ago by Aliaksei Holik80
5
gravatar for Aliaksei Holik
19 months ago by
Spain, Barcelona, CRG
Aliaksei Holik80 wrote:

After much searching and discussions with colleagues there are several solutions that seem to work here.

Solution one - pure bcftools:

This is a bit of an indirect solution, as it relies on the AC (allele count) field to "guess" if there are ANY genotypes other than 0/0 for a given variant - it uses --min-ac option set to 1, which means that a variant has to have at least 1 non-reference allele to be retained.

--no-update parameter is optional, as I'd like to keep the INFO field unchanged. Incidentally, this is also the fasted solution as tested on 10K variants.

time bcftools view --samples-file samples.to.keep.txt --min-ac=1 --no-update test.vcf.gz > subset.bcftools.vcf
real    0m16.232s
user    0m15.848s
sys     0m0.060s

Solution two - pure vcftools

Turns out that vcftools has exactly the parameter for this job: --non-ref-ac-any. So if you set it to 1, and at least one sample has a non-reference allele, the variant will be retained. I guess 1 can be substituted with a range, but I haven't tested it extensively. --recode makes the command to output a vcf file and --recode-INFO-all keeps the INFO field unchanged. This solutions is substantially slower than bcftools above.

time vcftools --gzvcf test.vcf.gz --keep samples.to.keep.txt --non-ref-ac-any 1 --recode --recode-INFO-all --stdout > subset.vcftools.vcf
real    0m52.590s
user    0m52.309s
sys     0m0.033s

Solution three - vcftools + grep

This is more for completeness sake, as the two above solutions are perfectly adequate. But it might be useful for starting with an already sample-subsetted vcf file. It's a two-step solution: first I sub-select the required samples using vcftools (I could equally use bcftools here). The result is then piped to grep (or egrep in this case) that collects the lines that have at least one genotype that is not "0/0". There are four conditions for the grep command that cover all possible genotypes plus the '#' for the header lines. This is slightly slower than the pure vcf solution.

time vcftools --gzvcf test.vcf --keep samples.to.keep.txt --recode --recode-INFO-all --stdout | egrep "0[/|][1-9]|[1-9][/|]0|[1-9][/|][1-9]|#" > subset.grep.vcf
real    1m4.191s
user    1m3.991s
sys     0m0.056s
ADD COMMENTlink written 19 months ago by Aliaksei Holik80
1
gravatar for Chris Whelan
19 months ago by
Chris Whelan540
Portland, OR
Chris Whelan540 wrote:

You should be able to just use the --excludeNonVariants flag for GATK SelectVariants.

ADD COMMENTlink written 19 months ago by Chris Whelan540

Thank you Chris. I'll give it a try, once I get GATK to run.

ADD REPLYlink written 19 months ago by Aliaksei Holik80
1
gravatar for Chris Whelan
19 months ago by
Chris Whelan540
Portland, OR
Chris Whelan540 wrote:

You should be able to just use the --excludeNonVariants flag for GATK SelectVariants.

ADD COMMENTlink written 19 months ago by Chris Whelan540
0
gravatar for Pierre Lindenbaum
19 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum104k wrote:

GATK SelectVariants https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

with

  • -sn (sample name) option
  • -removeUnusedAlternates .
  • -selectexpressions 'vc.getAlternateAlleles().isEmpty())

(not tested)

ADD COMMENTlink modified 19 months ago • written 19 months ago by Pierre Lindenbaum104k

Thanks! I had the suggestion for GATK SelectVariants from somebody else as well, but wasn't able to get past making it to run :( So still not tested, but it looks promising.

ADD REPLYlink modified 19 months ago • written 19 months ago by Aliaksei Holik80
0
gravatar for maduh17
9 days ago by
maduh1710
Kuala Lumpur, Malaysia
maduh1710 wrote:

You can also use awk, but be careful which columns:

cat raw_variants.vcf | \
awk '($0~/^#/)($0!~/^#/){split($10,x,":"); if(x[1]!="0/0") print }' > hard_filtered.vcf
ADD COMMENTlink written 9 days ago by maduh1710
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1243 users visited in the last hour