Inspect the LD relation between a single pair of SNPs in detail using Plink --ld
1
0
Entering edit mode
4.4 years ago
Tao ▴ 530

When I calculate the LD details between a pair of SNPs using following command as an example, why there might be two solutions shown as following (Q1)?

plink --bfile 1000G.genotypes --ld rs10005588 rs12505231

>

Solution #1:
   R-sq = 0.000943758    D' = 1

   Haplotype     Frequency    Expectation under LE
   ---------     ---------    --------------------
          GA      0                       0.000888
          AA      0.030815                0.029927
          GC      0.028827                0.027939
          AC      0.940358                0.941246

   In phase alleles are GC/AA

Solution #2:
   R-sq = 0.426699       D' = 0.676064

   Haplotype     Frequency    Expectation under LE
   ---------     ---------    --------------------
          GA      0.019777                0.000888
          AA      0.011038                0.029927
          GC      0.009050                0.027939
          AC      0.960135                0.941246

   In phase alleles are GA/AC

Q2: As described in PLINK, the "Frequency" column represents "observed" frequencies of each haplotype, so why the "Frequencies" in the two solutions are different?

"To inspect the relation between a single pair of variants in more detail, you can use the --ld flag, which displays observed and expected (based on MAFs) frequencies of each haplotype, as well as haplotype-based r2 and D'. When there are multiple biologically possible solutions to the haplotype frequency cubic equation, all are displayed (instead of just the maximum likelihood solution identified by --r/--r2), along with HWE exact test statistics. by PLINK1.9"

plink --ld LD • 1.3k views
ADD COMMENT
1
Entering edit mode
4.4 years ago

Q1. Humans have two distinct copies of every autosome (setting aside trisomy 21, etc. for now).

Suppose someone is heterozygous for both variants: they have both an A and a G at rs10005588, and they have both an A and a C at rs12505231. Then there are two haplotype possibilities:

  • One copy of chromosome 4 has both As, and the other copy has a G at rs10005588 and a C at rs12505231.
  • One copy of chromosome 4 has an A at rs10005588 and a C at rs12505231, and the other copy has a G at rs10005588 and an A at rs12505231.

Many analyses do not care about the difference between these two possibilities--they can ignore 'phase'. However, the --ld computation actually needs a separate frequency estimate for each haplotype. This is where the CubeX reference (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-428 ) comes into play: the solutions of that cubic equation are the local maxima of a commonly used haplotype-frequency likelihood function. PLINK 1.9 --ld reports all the local maxima.

With that said, you're probably most interested in the global maximum; in retrospect, PLINK 1.9 --ld should have labeled that, too. PLINK 2.0 --ld corrects this oversight.

Q2. As mentioned above, PLINK usually doesn't know the actual haplotypes for samples which are heterozygous for both variants. (Exception: you are using PLINK 2 .pgen files constructed from a phased VCF. Note that the Eagle (https://data.broadinstitute.org/alkesgroup/Eagle/ ), SHAPEIT4 (https://odelaneau.github.io/shapeit4/ ), and Beagle (https://faculty.washington.edu/browning/beagle/beagle.html ) software packages can produce pretty good phased VCFs when you don't have any phase information from the original genotyping/sequencing process.) The frequency column displays the haplotype-frequency guesses that PLINK is using.

ADD COMMENT
0
Entering edit mode

Thanks for your comments. But I am not quite understand that why plink doesn't directly count the real haplotypes from samples? since the 1000G genotypes are phased.

ADD REPLY
1
Entering edit mode

PLINK 2.0 —ld does count the real haplotypes.

PLINK 1.x is incapable of that because its core file format can’t represent phase.

Since your "1000G.genotypes" fileset is in PLINK 1.x's format (.bed), even running PLINK 2.0 --ld on it won't work; the phase information was lost during the original conversion to .bed. Use the files at https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3 with PLINK 2.0 instead.

ADD REPLY
0
Entering edit mode

Thanks for your prompt and deep insights!

ADD REPLY

Login before adding your answer.

Traffic: 1487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6