Genotype Frequencies Calculation
8.2 years ago
Peixe ▴ 640

Hi there,

Could anyone show me a straightforward method to retrieve genotypic frequencies from a tped or vcf?

I mean, not the expected frequencies calculated assuming Hardy Weinberg equilibrium from the allelic frequencies, but the real genotypic frequencies. I found a way using plink's --hardy option, which gives you the genotype counts, amongst many other stuff, and from these counts retrieve the frequencies. But I was wondering if a more simple way, analog the --freq option from plink or vcftools for the allelic frequencies. I know this may be a silly question, but had not found anything.

P.

genotype vcf plink • 6.2k views
I don't understand the problem, please clarify. As --freq with --counts would give you counts? Also, try --model option, it gives all sorts of counts, too.

I think it is clear enough... It is simply (to) "retrieve genotypic frequencies from a tped or vcf". Not genotype counts, but frequency numbers directly. I was just wondering if there was a method to retrieve it in the same format as when you retrieve the allele frequencies with vcftools or plink. That's all...

8.2 years ago
zx8754 10k

Following one-liner will convert Plink --hwe output from counts to frequencies:

#remove header, substitute "/" to "tabs", calculate frequencies, output to new file
sed 1d myfile.hwe | \
sed 's:/:\t:g' | \
awk '{OFS="\t";print $1,$2,$3,$4,$5,($6/($6+$7+$8))"/"($7/($6+$7+$8))"/"($8/($6+$7+$8)),$9,$10,$11}' \
> myfile.hwe.freq


Example:

#input
CHR         SNP     TEST   A1   A2                 GENO   O(HET)   E(HET)            P
22   rs2027653      ALL    C    T        489/1585/1498   0.4437   0.4601       0.0349
22   rs2027653      AFF    C    T          241/772/752   0.4374   0.4581      0.06132
22   rs2027653    UNAFF    C    T          248/813/746   0.4499    0.462        0.263
#output
22      rs2027653       ALL     C       T       0.136898/0.443729/0.419373     0.4437   0.4601  0.0349
22      rs2027653       AFF     C       T       0.136544/0.437394/0.426062     0.4374   0.4581  0.06132
22      rs2027653       UNAFF   C       T       0.137244/0.449917/0.412839     0.4499   0.462   0.263

Nice one! I had already written some small code in Python to do it, but this is cleaner. I guess there is no direct way to retrieve it, then... Thanks!

Note that because of floating point conversions, the sum of 3 frequencies will not always give you 1 (i.e.: 100%).

Yes, I realized about it. But with Python it does.

8.2 years ago

If I understand your question correctly, the information you're looking for is contained in the output of --hardy in vcftools.

Yes, but its barely the same as I did with plink. I was asking for a way to retrieve the frequency numbers and the genotypes directly, as vcftools does with the allelic frequencies. Thanks anyway! :)