I am trying to calculate haplotype blocks with PLINK 1.9 command
--blocks. My data is unphased and I would like to know whether I need my data to be phased in order to correctly estimate blocks this way. I know that the algorithm used is the one from Haploview (Gabriel et al., 2002, Science), and that what PLINK runs is an estimation of the procedure.
The Haploview method is based on D' calculations, and to calculate D', with the equation in hand, one would need the haplotype frequencies to compare to the product of allele frequencies.
However, in the Haploview paper of 2004 they explicitly say that
Haploview accepts input in a variety of formats. Pedigree data can be loaded as either partially or fully phased chromosomes or as unphased diplotypes in the standard Linkage format
And also, I came across this paper where PLINK authors explain the procedure in detail:
Briefly, the method involves using 90% confidence intervals for D' (as defined by Wall and Pritchard) to classify pairs of variants as “strong LD”, “strong evidence for historical recombination”, or “inconclusive”; then, contiguous groups of variants where “strong LD” pairs outnumber “recombination” pairs by more than 19 to 1 are greedily selected, starting with the longest base-pair spans.
As far as I understand all this: am I safe in assuming that the
--blocks option runs an estimation of the blocks, and thus my data are not strictly required to be phased?