Remove positions that are non-variant in a subset of samples from a vcf file
5
8
Entering edit mode
4.7 years ago

Hi all,

I have a multi-sample vcf file and I would like to produce a new vcf file containing a subset of samples, while excluding the variants that don't have any alternative allele genotypes in that subset of samples (i.e. exclude all the variants with genotype 0/0). I can use vcf tools to extract the samples of interest and I can also extract the genotypes for these samples. But I can't seem to find a filtering option (if there is one) that would allow me to exclude variants with all 0/0 genotypes.

Is there a standard way of doing this, or would I need to concoct something custom?

Many thanks!

vcf vcftools genotype filtering bcftools • 8.7k views
23
Entering edit mode
4.7 years ago

After much searching and discussions with colleagues there are several solutions that seem to work here.

## Solution one - pure bcftools:

This is a bit of an indirect solution, as it relies on the AC (allele count) field to "guess" if there are ANY genotypes other than 0/0 for a given variant - it uses --min-ac option set to 1, which means that a variant has to have at least 1 non-reference allele to be retained.

--no-update parameter is optional, as I'd like to keep the INFO field unchanged. Incidentally, this is also the fasted solution as tested on 10K variants.

time bcftools view --samples-file samples.to.keep.txt --min-ac=1 --no-update test.vcf.gz > subset.bcftools.vcf
real    0m16.232s
user    0m15.848s
sys     0m0.060s


## Solution two - pure vcftools

Turns out that vcftools has exactly the parameter for this job: --non-ref-ac-any. So if you set it to 1, and at least one sample has a non-reference allele, the variant will be retained. I guess 1 can be substituted with a range, but I haven't tested it extensively. --recode makes the command to output a vcf file and --recode-INFO-all keeps the INFO field unchanged. This solutions is substantially slower than bcftools above.

time vcftools --gzvcf test.vcf.gz --keep samples.to.keep.txt --non-ref-ac-any 1 --recode --recode-INFO-all --stdout > subset.vcftools.vcf
real    0m52.590s
user    0m52.309s
sys     0m0.033s


## Solution three - vcftools + grep

This is more for completeness sake, as the two above solutions are perfectly adequate. But it might be useful for starting with an already sample-subsetted vcf file. It's a two-step solution: first I sub-select the required samples using vcftools (I could equally use bcftools here). The result is then piped to grep (or egrep in this case) that collects the lines that have at least one genotype that is not "0/0". There are four conditions for the grep command that cover all possible genotypes plus the '#' for the header lines. This is slightly slower than the pure vcf solution.

time vcftools --gzvcf test.vcf --keep samples.to.keep.txt --recode --recode-INFO-all --stdout | egrep "0[/|][1-9]|[1-9][/|]0|[1-9][/|][1-9]|#" > subset.grep.vcf
real    1m4.191s
user    1m3.991s
sys     0m0.056s

2
Entering edit mode
4.7 years ago
Chris Whelan ▴ 550

You should be able to just use the --excludeNonVariants flag for GATK SelectVariants.

0
Entering edit mode

Thank you Chris. I'll give it a try, once I get GATK to run.

0
Entering edit mode
4.7 years ago

with

• -sn (sample name) option
• -removeUnusedAlternates .
• -selectexpressions 'vc.getAlternateAlleles().isEmpty())

(not tested)

0
Entering edit mode

Thanks! I had the suggestion for GATK SelectVariants from somebody else as well, but wasn't able to get past making it to run :( So still not tested, but it looks promising.

0
Entering edit mode
3.2 years ago

You can also use awk, but be careful which columns:

cat raw_variants.vcf | \
awk '($0~/^#/)($0!~/^#/){split(\$10,x,":"); if(x[1]!="0/0") print }' > hard_filtered.vcf