I have a list of genes, for which I need to compare LD plots of the gene across all subpopulations in 1000 Genome. I have written a perl script to do so but have run into a few challenges.
My script does the following, given a gene name:
- extract the list of variants (maf 0.05) present in 1000G data from the VCF file
- For each subpopulation in 1000G, pairwise LD is calculated for all variants using PLINK
- Plot the LD for all subpopulations in one PDF file using R.
However, I run into a couple of challenges, as I am new to population genetics.
- From the image below, you can see the list of variants is not the same for all subpopulations. So can I just take the common subset of variants so that I can compare the LD among them?
2) For some genes, the number of variants are too many (>100, sometimes 200-300) and hence the LD plot does not appear or is uninformative (see below). How can I subset the list of variants WITHOUT LOSING LD structure? (NOTE: --indep option in PLINK is not suitable for me, I am NOT looking for independent SNPS)