Question

LD, LD-decay and Correlation Coefficients of SNP loci

0

Entering edit mode

6.8 years ago

mysisayk ▴ 20

Dear all, I am working on GWAS of yellow rust resistance in durum wheat. I have ~ 9k SNPs genotype data generated from 293 individuals. The data is converted in to numerical format with 1 and -1 for the most and less frequent homozygote genotypes respectively (AA, CC, GG, &TT), -9 for the missing values and no 0's as we used single seed descent lines for the genotyping and heterozygotes are not expected. It also includes the allele frequency values. I want to calculate the LD, LD-decay and Correlation Coefficients for this data preferably in R. I could not address this problem with my little experience and I need help. I tried to calculate the Correlation coefficient between all possible combination of loci taking each two at a time in Excel. But this is so cumbersome and unthinkable to finish it this way. Besides, I am not sure if I can use the numerical data (mentioned above) for all the calculations. Can any one help me in suggesting a clue on how to get this done?

Thanks before hand! Sisay

R SNP • 2.5k views

ADD COMMENT • link updated 6.8 years ago by aays ▴ 180 • written 6.8 years ago by mysisayk ▴ 20

score 0 · Answer 1 · 2017-07-23

Do you happen to have your data in VCF format prior to the format conversion? If so, PLINK 1.9 has been an excellent all-in-one tool for GWAS that I've found success using.

Otherwise, I'd think allele frequencies are the way to go in calculating pairwise LD. I'm not sure I fully understand the specifics of the format your data is in, but the way I'd tackle the problem would be:

a) Find a way to read in your data that preserves the frequencies needed for each site. If we're to stick to the idea of allele frequencies, perhaps their final iteration could be a list of vectors of size 4 (I'm assuming biallelic SNPs) containing chromosome number + site + frequencies of the major and minor allele (i.e. c('chr1', 24, 0.58, 0.42) for position 24 of chr1 with a major allele freq of 0.58 and a MAF of 0.42). Or perhaps a data frame with each row representing a site and each column representing one of those frequency values? I'm not particularly an expert at what the most efficient method might be here, but as long as you figure out something along those lines.

b) Once you've settled on a format, create another function that would calculate parameters needed for LD calculations. The intricacies of this obviously boil down to how your data is organized, but a simple way of conceptualizing D, for instance, might be

dcalc <- function(p11, p22, p12, p21) {
    LHS <- p11 * p22
    RHS <- p12 * p21
    out <- LHS - RHS
    return(out)
}

Which would itself necessitate a function to actually parse your data and calculate the parameters I've denoted as p11 etc. This function could take in pairs of records and return said parameters for use in the actual pairwise LD calculation functions.

c) Once these are finalized, load in your data and use nested loops to perform pairwise comparisons. This is ghoulish amounts of slow, I know, and especially so in the loop-averse world of R, but I'm trying to stick to the most accessible way to do this. Let's say you have your data in two separate lists denoting two regions you want to calculate pairwise LD between. Here's an extremely rough idea of what that might look like:

for (a in 1:length(region1)) {
    for (b in 1:length(region2)) {
        freqs <- freqscalc(region1[a], region2[b]) # assuming this fxn calculates p11 etc
        print(dcalc(freqs[1], freqs[2], freqs[3], freqs[4]))
    }
}

Obviously you wouldn't want to use print here in something like RStudio at the risk of polluting your console, so it might be worthwhile to make an executable R script out of this and run it in the shell in order to use Unix redirects for saving your output to a file.

Something along these lines can be used to calculate r2 and D' as well - all it necessitates is actually coding up the functions.

Finally, LD decay could be calculated by having your functions also return the distance between sites they're calculating LD for, and outputting that in addition to the actual pairwise LD statistic between those sites. These values could then be saved to a data frame object, for instance, with each row containing a) a value for distance and b) the amount of LD. Once you've obtained that for all desired comparisons, you could use that data frame to simply plot LD ~ distance.

Hope this helps!