Question: Inspect the LD relation between a single pair of SNPs in detail using Plink --ld
0
12 months ago by
Tao420
Tao420 wrote:

When I calculate the LD details between a pair of SNPs using following command as an example, why there might be two solutions shown as following (Q1)?

``````plink --bfile 1000G.genotypes --ld rs10005588 rs12505231
``````

>

``````Solution #1:
R-sq = 0.000943758    D' = 1

Haplotype     Frequency    Expectation under LE
---------     ---------    --------------------
GA      0                       0.000888
AA      0.030815                0.029927
GC      0.028827                0.027939
AC      0.940358                0.941246

In phase alleles are GC/AA

Solution #2:
R-sq = 0.426699       D' = 0.676064

Haplotype     Frequency    Expectation under LE
---------     ---------    --------------------
GA      0.019777                0.000888
AA      0.011038                0.029927
GC      0.009050                0.027939
AC      0.960135                0.941246

In phase alleles are GA/AC
``````

Q2: As described in PLINK, the "Frequency" column represents "observed" frequencies of each haplotype, so why the "Frequencies" in the two solutions are different?

"To inspect the relation between a single pair of variants in more detail, you can use the --ld flag, which displays observed and expected (based on MAFs) frequencies of each haplotype, as well as haplotype-based r2 and D'. When there are multiple biologically possible solutions to the haplotype frequency cubic equation, all are displayed (instead of just the maximum likelihood solution identified by --r/--r2), along with HWE exact test statistics. by PLINK1.9"

ld plink --ld • 326 views
modified 12 months ago by chrchang5237.4k • written 12 months ago by Tao420
1
12 months ago by
chrchang5237.4k
United States
chrchang5237.4k wrote:

Q1. Humans have two distinct copies of every autosome (setting aside trisomy 21, etc. for now).

Suppose someone is heterozygous for both variants: they have both an A and a G at rs10005588, and they have both an A and a C at rs12505231. Then there are two haplotype possibilities:

• One copy of chromosome 4 has both As, and the other copy has a G at rs10005588 and a C at rs12505231.
• One copy of chromosome 4 has an A at rs10005588 and a C at rs12505231, and the other copy has a G at rs10005588 and an A at rs12505231.

Many analyses do not care about the difference between these two possibilities--they can ignore 'phase'. However, the --ld computation actually needs a separate frequency estimate for each haplotype. This is where the CubeX reference (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-428 ) comes into play: the solutions of that cubic equation are the local maxima of a commonly used haplotype-frequency likelihood function. PLINK 1.9 --ld reports all the local maxima.

With that said, you're probably most interested in the global maximum; in retrospect, PLINK 1.9 --ld should have labeled that, too. PLINK 2.0 --ld corrects this oversight.

Q2. As mentioned above, PLINK usually doesn't know the actual haplotypes for samples which are heterozygous for both variants. (Exception: you are using PLINK 2 .pgen files constructed from a phased VCF. Note that the Eagle (https://data.broadinstitute.org/alkesgroup/Eagle/ ), SHAPEIT4 (https://odelaneau.github.io/shapeit4/ ), and Beagle (https://faculty.washington.edu/browning/beagle/beagle.html ) software packages can produce pretty good phased VCFs when you don't have any phase information from the original genotyping/sequencing process.) The frequency column displays the haplotype-frequency guesses that PLINK is using.

Thanks for your comments. But I am not quite understand that why plink doesn't directly count the real haplotypes from samples? since the 1000G genotypes are phased.

1

PLINK 2.0 —ld does count the real haplotypes.

PLINK 1.x is incapable of that because its core file format can’t represent phase.

Since your "1000G.genotypes" fileset is in PLINK 1.x's format (.bed), even running PLINK 2.0 --ld on it won't work; the phase information was lost during the original conversion to .bed. Use the files at https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3 with PLINK 2.0 instead.