Question: Inspect the LD relation between a single pair of SNPs in detail using Plink --ld
0
gravatar for Tao
12 months ago by
Tao420
Tao420 wrote:

When I calculate the LD details between a pair of SNPs using following command as an example, why there might be two solutions shown as following (Q1)?

plink --bfile 1000G.genotypes --ld rs10005588 rs12505231

>

Solution #1:
   R-sq = 0.000943758    D' = 1

   Haplotype     Frequency    Expectation under LE
   ---------     ---------    --------------------
          GA      0                       0.000888
          AA      0.030815                0.029927
          GC      0.028827                0.027939
          AC      0.940358                0.941246

   In phase alleles are GC/AA

Solution #2:
   R-sq = 0.426699       D' = 0.676064

   Haplotype     Frequency    Expectation under LE
   ---------     ---------    --------------------
          GA      0.019777                0.000888
          AA      0.011038                0.029927
          GC      0.009050                0.027939
          AC      0.960135                0.941246

   In phase alleles are GA/AC

Q2: As described in PLINK, the "Frequency" column represents "observed" frequencies of each haplotype, so why the "Frequencies" in the two solutions are different?

"To inspect the relation between a single pair of variants in more detail, you can use the --ld flag, which displays observed and expected (based on MAFs) frequencies of each haplotype, as well as haplotype-based r2 and D'. When there are multiple biologically possible solutions to the haplotype frequency cubic equation, all are displayed (instead of just the maximum likelihood solution identified by --r/--r2), along with HWE exact test statistics. by PLINK1.9"

ld plink --ld • 326 views
ADD COMMENTlink modified 12 months ago by chrchang5237.4k • written 12 months ago by Tao420
1
gravatar for chrchang523
12 months ago by
chrchang5237.4k
United States
chrchang5237.4k wrote:

Q1. Humans have two distinct copies of every autosome (setting aside trisomy 21, etc. for now).

Suppose someone is heterozygous for both variants: they have both an A and a G at rs10005588, and they have both an A and a C at rs12505231. Then there are two haplotype possibilities:

  • One copy of chromosome 4 has both As, and the other copy has a G at rs10005588 and a C at rs12505231.
  • One copy of chromosome 4 has an A at rs10005588 and a C at rs12505231, and the other copy has a G at rs10005588 and an A at rs12505231.

Many analyses do not care about the difference between these two possibilities--they can ignore 'phase'. However, the --ld computation actually needs a separate frequency estimate for each haplotype. This is where the CubeX reference (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-428 ) comes into play: the solutions of that cubic equation are the local maxima of a commonly used haplotype-frequency likelihood function. PLINK 1.9 --ld reports all the local maxima.

With that said, you're probably most interested in the global maximum; in retrospect, PLINK 1.9 --ld should have labeled that, too. PLINK 2.0 --ld corrects this oversight.

Q2. As mentioned above, PLINK usually doesn't know the actual haplotypes for samples which are heterozygous for both variants. (Exception: you are using PLINK 2 .pgen files constructed from a phased VCF. Note that the Eagle (https://data.broadinstitute.org/alkesgroup/Eagle/ ), SHAPEIT4 (https://odelaneau.github.io/shapeit4/ ), and Beagle (https://faculty.washington.edu/browning/beagle/beagle.html ) software packages can produce pretty good phased VCFs when you don't have any phase information from the original genotyping/sequencing process.) The frequency column displays the haplotype-frequency guesses that PLINK is using.

ADD COMMENTlink modified 12 months ago • written 12 months ago by chrchang5237.4k

Thanks for your comments. But I am not quite understand that why plink doesn't directly count the real haplotypes from samples? since the 1000G genotypes are phased.

ADD REPLYlink written 12 months ago by Tao420
1

PLINK 2.0 —ld does count the real haplotypes.

PLINK 1.x is incapable of that because its core file format can’t represent phase.

Since your "1000G.genotypes" fileset is in PLINK 1.x's format (.bed), even running PLINK 2.0 --ld on it won't work; the phase information was lost during the original conversion to .bed. Use the files at https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3 with PLINK 2.0 instead.

ADD REPLYlink modified 12 months ago • written 12 months ago by chrchang5237.4k

Thanks for your prompt and deep insights!

ADD REPLYlink written 12 months ago by Tao420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1501 users visited in the last hour