Question

How To Perform A Multilocus (Snp Genotype Data) Linkage Disequilibrium Analysis Using R?

3

Entering edit mode

12.9 years ago

Francois Olivier Hébert ▴ 280

Hi,

I am working on a non-model species and I have a set of ~2300 genes in which I have identified multiple SNPs and I would like to perform a multilocus linkage disequilibrium analysis on my dataset. I've been looking on the web for a while now and I can't find anything that could help me with regards to the type of data that I've generated. I've read previous posts on the subject on this forum and it doesn't really help me.

I've explored a few possibilities so far:

Haploview (inappropriate input format, well suited for HapMap data, which I don't have)
GOLD (designed for human data)
TASSEL (alignment input files pretty hard to generate with my data)

And finally, I tried to use PyPop (http://www.pypop.org/), but I systematically get an error message when I try to run it and I can't reach anybody who could potentially help me, there is an invalid email address in the manual.

So, I'm still stuck with my problem : I would like to compute pairwise LD between my loci or have any sort of LD measure that could help me find associations between SNPs. It would be interesting to generate a heatmap to visualize the results and I know it is possible to do that using R.

In fact, I've been able to compute pairwise LD between loci using R with LD(), the only problem is that I don't really know how to apply it simultaneously for thousands of loci.

My starting point is this type of file that I've generated using custom Python scripts (I recently learned how to use Python and I'm getting better, but I'm still unexperienced with R, which is a shame since it's a very powerful tool in bioinformatics!):

Locus        sample_1        sample_2       sample_3       ...
Loc1         A/A             A/C            A/A            ...
Loc2         G/T             NA             G/T            ...
...

It's a TAB delimited genotype matrix file that goes on like that for ~2300 loci, with 24 samples and some missing data (specified by the symbol "NA"). From what I understand so far, if I want to use R, I have to create one genotype object per locus, using genotype():

> geno.loc1 <- c('A/A', 'A/C', 'A/A', ...)
> g1 <- genotype(geno.loc1, sep = "/")

I would have to do this for all the 2300 loci and then construct a data frame with all the genotype objects and compute pairwise LD:

> data <- makeGenotypes(data.frame(g1,g2,g3,..., g2300))
> ld.table <- LD(data)

g1, g2 and g3 being the genotype objects.

And then, with the matrix generated by LD() (or I think it's possible to just use the genotype data frame) I can use LDheatmap():

> LDheatmap(ld.table, ...)

I am totally not sure if this is the right direction or if there is something less complicated or more user friendly, but the point is... I don't know how to loop over all the loci in my input file to construct the genotype objects. And I guess I would have to loop over my genotype objects to create the genotype data frame. It's probably really easy to do, I'm just not sure how to do it and if it's the appropriate solution to my problem.

Is it a good way to do what I want to do? Is there a better solution?

Any help would be tremendously appreciated!

r snp • 12k views

ADD COMMENT • link updated 12.9 years ago by Michael 56k • written 12.9 years ago by Francois Olivier Hébert ▴ 280

Leonor Palmeira · Answer 1 · 2012-08-02

2

Entering edit mode

12.9 years ago

Michael 56k

Could you have a look at

(both are non-R tools (c++, and perl), but I would weight correct and efficient computation over language this time) The most exact way (see e.g. the PLINK manual) of computing LD is from haplotype frequencies via EM estimation, not from allele frequencies (I am not sure if LD() from package genetics does this, but it doesn't seem to easily extend to all vs all comparisons). INTERSNP can do that 'genome-wide' however and compute r2 values. Maybe you need to convert your data to fit one of the possible input formats. If you can get them into VCF format, that would be a good start because the plink formats (tped/tmap) can be derived from it.

Edit: Here is another post which shows how to do this using LDheatmap but as I a said, I am not sure if that is the state-of-the-art way of doing this.

ADD COMMENT • link updated 12.9 years ago by Leonor Palmeira 3.9k • written 12.9 years ago by Michael 56k

1

Entering edit mode

If you choose the plink option, it could be useful to check http://pngu.mgh.harvard.edu/~purcell/plink/rfunc.shtml and see how it can communicate with R

ADD REPLY • link 12.9 years ago by ff.cc.cc ★ 1.3k

0

Entering edit mode

Thanks for the help! Actually, the problem with the input files for these tools is that information on chromosome and genetic distance is required... which I don't have. I'm working on the American whitefish for which the genome isn't sequenced yet. In fact, there is no salmonid genome currently available and it seems very hard to work with tools that are made for genome wide data in this context. Nonetheless, I will try to create a .tped file with my data, but I'm worried that the results are going to be somewhat misguiding since I have no idea where the genes are in the gnome (on what chromosome, how far are they from each other, are they genetically or physically linked ?). In the end, I'm thinking that maybe with the type of data that I have, it might not be relevant to try to estimate the extent of LD among genes that aren't positioned on a genome, i.e with genotype information only. All I have is gene names (assembled genes from an exon sequence capture chip), SNP genotypes and SNP positions on the genes.

ADD REPLY • link 12.9 years ago by Francois Olivier Hébert ▴ 280

0

Entering edit mode

I understand the problem, and it might be worth double checking how LD is calculated by the tools. However, it seems a bit counterintuitive to assume that the distance between markers should go into the LD calculation; other than using a cutoff for the maximum distance between marker pairs to take into account (at least that is the case for plink i guess). You might also ask the authors of plink and intersnp directly. Instead, afaik, linkage analysis has been used to calculate genetic distance, so you could use the LD analysis for an approximate mapping of genes. But please try to validate these suggestions before you jump to any conclusion.

ADD REPLY • link 12.9 years ago by Michael 56k