Question

LD test yielding lots of R2 and D-prime of 1.

0

Entering edit mode

5.6 years ago

Pistachio ▴ 20

Hi,

I'm currently analyzing some bee genomes (honey bees, bumble bees, wasps) and doing some linkage disequilibrium tests on the SNPs. The method I'm using is Plink 1.9 --r2 with mostly default parameters.

All three data have been yielding lots of r2 and dprime with values of 1, almost regardless of the distance between the SNPs.

enter image description here

Bumble bee is entirely comprised of drones, which are haploids so I initially thought that was the issue. However, boney bee and wasp data includes worker genome, so they should be diploids.

Now I'm suspecting something else to be the problem, but can't guess what that could be.

Setting minor allele frequency cut off to 0.3 removes some chunk of the r2 of 1s, but also removes some other points as well. Also tried using vcftools hap-r2 or geno-r2, but those run for hours and never finish.

Would appreciate any helps and pointers and thank you in advance.

genome • 1.3k views

ADD COMMENT • link 5.6 years ago by Pistachio ▴ 20

0

Entering edit mode

Problem solved. Switched to vcftools (not sure why my data didn't work well in Plink, but might be that it was haploid), which still showed issue, but in a different way. Then averaged the LD values for every given distances and the graph came out clean.

ADD REPLY • link 5.3 years ago by Pistachio ▴ 20

0

Entering edit mode

Ooh! how did you average LD values for given distances?

To calculate LD across ~220 individuals in ~10 populations, I used vcftools --r2 and filtered for R.2 in SNPs of the top 100 longest scaffolds of my assembly. It seems like the mean values are still wacky... or I haven't found out an effective way to group the distances.

Here's the raw output from vcftools

> head(ld_top100scaffs)
  CHR  POS1  POS2 N_INDV        R.2
1   1 39963 39978    211 0.01736520
2   1 39963 40007    209 0.09148530
3   1 39963 40016    211 1.00000000
4   1 39963 40041    211 1.00000000
5   1 39963 40080    211 0.00684638
6   1 39963 40084    211 0.76684000

mean_lddat <- ld_top100scaffs %>%
  mutate(mean_pos = ((POS2 + POS1) /2) / 1e+06) %>%
  group_by(mean_pos)  %>%
  mutate(mean_R.2 = mean(R.2))

Looks like this (pardon the external link, couldn't get the embedded images to work)

LD

Still, if I try to summarize LD by distance class

dist.class <- cut(lddat$mean_pos_Mb, seq(from= min(lddat$mean_pos_Mb), to = max(lddat$mean_pos_Mb), by = .1), include.lowest = TRUE)
mean.R.2 <- tapply(lddat$R.2, dist.class, mean, na.rm = TRUE)

LD by distance class

There's still no decay in LD even at 8Mb...

Any ideas or suggestions would be appreciated! Thanks!

ADD REPLY • link 5.3 years ago by amandastahlke • 0