Linkage disequilibrium estimation in unphased data?
1
0
Entering edit mode
7 weeks ago

Hello there,

I've been tasked to LD-prune a set of variants that my group is interested in. Our data is human WES, and it is not phased. I don't think it can be phased, either, since phasing requires reference panels, and the panels only work with WGS data (as far as I know). So, if my understanding of LD is correct, you simply can't calculate it with unphased data.

So, the next question is, then: can I do any better than a simple correlation test for every pair of variants in my region of interest? I could do Pearson's test, or Kendall's Tau, for example. The latter would make more sense if I'm using allele counts as input (Kendall's test is for ordinal variables).

Perhaps you can tell these things are a bit new to me, I might be saying or assuming dumb things, don't hesitate to call me out ~ thanks in advance!

Joel

correlation WES Pearson LD Kendall • 594 views
ADD COMMENT
0
Entering edit mode
7 weeks ago
dthorbur ★ 3.1k

Reference panels are not required for phasing. They're certainly nice to have though.

There are many tools you can use, but here is the recent SHAPEIT5 release paper that phased exome data as part of the technical report where they use a reference panel.

Phasing WES data would only be able to tell you segregation of SNPs within a coding sequence without a reference panel. You have no coverage between them, so if you aren't using a reference panel you wouldn't accurately be able to calculate LD. But you should be fine if you have a relevant reference panel, and since you're using human data it seems like there are some to choose from.

ADD COMMENT
0
Entering edit mode

Hello and thanks for replying!

In that paper, as well as according to the SHAPEIT5 tutorials, the WES data is phased after having been combined with SNP array data. We don't have any such data. So we'd be limited to, as you say, within-exon phasing. Even then, we'd need reference panels, since exons can be longer than 75bp which is our read length (the data is a bit on the old side, sadly). Considering most of the variants we'd like to estimate LD for are in different exons or exonic-intronic or hundreds of bp apart, I don't think this can be done :(

Also, the reference panels used in that SHAPEIT5 paper were from the UKB which is about half a million people large. The freely available panels that can be used without having to upload sensitive patient data to overseas servers, are a couple thousand samples large. I know for a fact that this makes a huge difference - we previously found a 5 Mbp shared haplotype only when using the HRC panel (almost as big as the UKB one).

Can we do any better than a correlation test? Perhaps we should just calculate the conditional probabilities? Like, P(var1|var2) and P(var2|var1), and if they're both > 95% or so, we call them linked, and only include one for the downstream...

ADD REPLY
0
Entering edit mode

75bp reads are pretty useless for linkage disequilibrium even if you had a WGS dataset. You need reads overlapping 2+ segregating sites to be informative.

I don't work on human data, so I didn't realise there were data upload requirements to access panels. I had thought you'd be able to download them from places like 1000G or NCBI. But I agree about the population-specific information bias in reference data (I have a nice Mol. Ecol. Res. paper on this).

You could always generate your own variant panel, but this would be considerable amount of work to make something robust.

I don't think a correlation test would be meaningful in this context. Your reads are too small to capture LD blocks, so no linkage inferences will be useable no matter how you try and approach it. If you had an extensive library of different insert lengths and super high coverage you might be able to use 75bp reads, but that's it IMO, and even then it would probably struggle to get past peer-review.

ADD REPLY
0
Entering edit mode

Hmm. Okay. So we'll give up on haplotyping ~ perhaps a permutation test would work? I think perhaps it's the best we can do, since without haplotypes we're essentially comparing two sorted 1D arrays (1 for mutation, 0 for not-mutation).

There are some small reference panels (e.g. 1KG, 2504 samples) available for download, but the good ones are only accessible by uploading one's data to servers/portals.

ADD REPLY

Login before adding your answer.

Traffic: 5113 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6