I have a VCF file and I want to remove sites that are not polymorphic sites i.e. sites where all individuals are REF/REF or ALT/ALT. Is there a quick way to do this using bcftools or vcftools?
Thanks.
I have a VCF file and I want to remove sites that are not polymorphic sites i.e. sites where all individuals are REF/REF or ALT/ALT. Is there a quick way to do this using bcftools or vcftools?
Thanks.
Now I hopefully understand your goal. This will remove side where all samples are either hom ref or hom alt:
$ bcftools view -e 'COUNT(GT="AA")=N_SAMPLES || COUNT(GT="RR")=N_SAMPLES' input.vcf
Have a look at the manual to explore more ways to filter vcf files using bcftools.
I've deleted my old answer, because it hasn't answer your question.
You can also use the -c function in view
-c, --min-ac INT[:nref|:alt1|:minor|:major|:'nonmajor']
minimum allele count (INFO/AC) of sites to be printed. Specifying the type of allele is optional and can be set to non-reference (nref, the default), 1st alternate (alt1), the least frequent (minor), the most frequent (major) or sum of all but the most frequent (nonmajor) alleles.
By setting -c 1, you will only keep sites with at least one nonref allele (=polymorphic in your data)
bcftools view -c 1 input.vcf.gz -o output.vcf.gz -Oz
using vcffilterjdk: http://lindenb.github.io/jvarkit/VcfFilterJdk.html
java -jar dist/vcffilterjdk.jar -e 'return !(variant.getGenotypes().stream().allMatch(G->G.isHomVar()) || variant.getGenotypes().stream().allMatch(G->G.isHomRef()));' input.vcf
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks! Yes, I think that answers my question. I got this error message though:
the tag "INFO/COUNT" is not defined in the VCF header. Do you know how to add this info on my vcf?I think you're using an outdated version of
bcftools. The methodCOUNTwas introduced in v1.7. The current version is v1.9.Which version are you using? Please upgrade.
Using v1.9 solved it - Thanks!
@finswimmer have a quick question, will bcftools get the COUNT of the AA or RR genotypes based on the INFO field? Or will it compute the COUNT for all sites? I am applied this filter after merging my dataset with another dataset and I think I am getting weird results.
For each line in the vcf file
bcftoolswill have a look at the genotype of each sample. In the INFO column there is no information about the counts.Why do you think you are getting a weird result? As always, a small example dataset which shows your problem would be helpful.
The command will not work if there are missing genotypes in the given position.