Why is VCFtools excluding invariable sites?
1
0
Entering edit mode
3.0 years ago
raf.marcondes ▴ 100

I have a vcf file with all genomic sites including invariable, which I generated with the option --ALL-SITES in gatk GenotypeGVCFs. I want to filter that file for quality, coverage etc, but the VCFtools command below excludes all invariant sites. Why? It doesn't seem like it should. I wanna keep them (if they pass filters).

module load vcftools/0.1.14/INTEL-18.0.0

vcftools --gzvcf raw_unfiltered_ALLSITES.vcf --max-missing 0.5 --minQ 30 --minDP 5 --recode --recode-INFO-all --out temp
vcftools • 2.7k views
ADD COMMENT
1
Entering edit mode

Consider using bcftools instead. VCFtools has not been updated for a long time and it won't be updated (according to its author). bcftools acts more reasonably than VCFtools.

ADD REPLY
0
Entering edit mode

Maybe describe more why you want to keep invariant sites in a variant call file? I think VCFs and vcftools are designed with the assumption that these files describe variants.

ADD REPLY
0
Entering edit mode

I too have come across this thread looking for the same answer. . why: Basically, to use pixy (https://pixy.readthedocs.io/en/latest/about.html) they require invariant sites https://pixy.readthedocs.io/en/latest/generating_invar/generating_invar.html#generating-allsites-vcfs-using-gatk , once you have the output file, it needs to be filtered. (yes I could have, should have, filtered all the g.vcf files first). But the gatk GenomicsDBImport step took weeeeeks to run ... I dont want to have to go back a step.

I will look into bcftools as an alternative

ADD REPLY
0
Entering edit mode

It seems "--minQ 30" filtered out most of your invariant sites. Quality has a different meaning for invariant sites as it's to variant sites. You should follow the pixy guide, separate the allsite VCF into invariant.vcf and variant.vcf and filter them separately.

ADD REPLY
0
Entering edit mode
2.2 years ago
Peter ▴ 10

found to solution here: https://pixy.readthedocs.io/en/latest/guide/pixy_guide.html

copy and paste:

If your VCF contains both variant and invariant sites (as it should at this point), applying population genetic based filters will result in the loss of your invariant sites. To avoid this, filter the invariant and variant sites separately and concatenate the two resulting files. Below is an example of one way to achieve this using VCFtool and BCFtools:

#!/bin/bash
# requires bcftools/bgzip/tabix and vcftools

# create a filtered VCF containing only invariant sites
vcftools --gzvcf test.vcf.gz \
--max-maf 0 \
[add other filters for invariant sites here] \
--recode --stdout | bgzip -c > test_invariant.vcf.gz

# create a filtered VCF containing only variant sites
vcftools --gzvcf test.vcf.gz \
--mac 1 \
[add other filters for variant sites here] \
--recode --stdout | bgzip -c > test_variant.vcf.gz

# index both vcfs using tabix
tabix test_invariant.vcf.gz
tabix test_variant.vcf.gz

# combine the two VCFs using bcftools concat
bcftools concat \
--allow-overlaps \
test_variant.vcf.gz test_invariant.vcf.gz \
-O z -o test_filtered.vcf.gz
ADD COMMENT

Login before adding your answer.

Traffic: 2698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6