Hi,
I am working on a non-model species and I have a set of ~2300 genes in which I have identified multiple SNPs and I would like to perform a multilocus linkage disequilibrium analysis on my dataset. I've been looking on the web for a while now and I can't find anything that could help me with regards to the type of data that I've generated. I've read previous posts on the subject on this forum and it doesn't really help me.
I've explored a few possibilities so far:
- Haploview (inappropriate input format, well suited for HapMap data, which I don't have)
- GOLD (designed for human data)
- TASSEL (alignment input files pretty hard to generate with my data)
And finally, I tried to use PyPop (http://www.pypop.org/), but I systematically get an error message when I try to run it and I can't reach anybody who could potentially help me, there is an invalid email address in the manual.
So, I'm still stuck with my problem : I would like to compute pairwise LD between my loci or have any sort of LD measure that could help me find associations between SNPs. It would be interesting to generate a heatmap to visualize the results and I know it is possible to do that using R.
In fact, I've been able to compute pairwise LD between loci using R with LD()
, the only problem is that I don't really know how to apply it simultaneously for thousands of loci.
My starting point is this type of file that I've generated using custom Python scripts (I recently learned how to use Python and I'm getting better, but I'm still unexperienced with R, which is a shame since it's a very powerful tool in bioinformatics!):
Locus sample_1 sample_2 sample_3 ...
Loc1 A/A A/C A/A ...
Loc2 G/T NA G/T ...
...
It's a TAB delimited genotype matrix file that goes on like that for ~2300 loci, with 24 samples and some missing data (specified by the symbol "NA"). From what I understand so far, if I want to use R, I have to create one genotype object per locus, using genotype()
:
> geno.loc1 <- c('A/A', 'A/C', 'A/A', ...)
> g1 <- genotype(geno.loc1, sep = "/")
I would have to do this for all the 2300 loci and then construct a data frame with all the genotype objects and compute pairwise LD:
> data <- makeGenotypes(data.frame(g1,g2,g3,..., g2300))
> ld.table <- LD(data)
g1, g2 and g3 being the genotype objects.
And then, with the matrix generated by LD()
(or I think it's possible to just use the genotype data frame) I can use LDheatmap()
:
> LDheatmap(ld.table, ...)
I am totally not sure if this is the right direction or if there is something less complicated or more user friendly, but the point is... I don't know how to loop over all the loci in my input file to construct the genotype objects. And I guess I would have to loop over my genotype objects to create the genotype data frame. It's probably really easy to do, I'm just not sure how to do it and if it's the appropriate solution to my problem.
Is it a good way to do what I want to do? Is there a better solution?
Any help would be tremendously appreciated!