Getting frequency of sites fixed within the sample (i.e. divergence sites) from VCF file
3.7 years ago
JGuVa ▴ 10

Hi there,

I am trying to extract fixed sites within the sample from a VCF file. By fixed sites, I mean those that differ from the reference genome but that are fixed within the sample.

    REF    ALT ind_1   ind_2  ind_3
1    A      C    1/1    1/1     1/1
2    G      T    1/1    0/1     0/0
3    C      G    1/1    1/1     1/0
4    G      C    0/1    1/1     1/0
5    A      G    1/1    1/1     1/1


For instance, this is was a simplified version of a VCF file. In this case, sites 1 and 5 belong to this category of sites that contribute to divergence. Is there any tool on vcftools or R package that I can use for this purpose?

not clear to me. You want the variants where all the genotypes are homozygous for the ALT allele ?

Yes, exactly, that is what I need.

In addition to Pierre: Is your data in this simplified format or a normal vcf?

My file is a normal VCF, I presented it like that just for the sake of the explanation.

3.7 years ago

using vcffilterjdk http://lindenb.github.io/jvarkit/VcfFilterJdk.html

java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().allMatch(G->G.isHomVar());' in.vcf

3.7 years ago

Using bcftools:

$bcftools view -i 'COUNT(GT="AA")=N_SAMPLES' input.vcf  or $ bcftools view -e 'GT[*]!="AA"' input.vcf


fin swimmer