How to compare LD of a gene for all subpopulations in 1000 Genome?
1
0
Entering edit mode
11 weeks ago
Rashmi ▴ 20

Hi,

I have a list of genes, for which I need to compare LD plots of the gene across all subpopulations in 1000 Genome. I have written a perl script to do so but have run into a few challenges.

My script does the following, given a gene name:

1. extract the list of variants (maf 0.05) present in 1000G data from the VCF file
2. For each subpopulation in 1000G, pairwise LD is calculated for all variants using PLINK
3. Plot the LD for all subpopulations in one PDF file using R.

However, I run into a couple of challenges, as I am new to population genetics.

1. From the image below, you can see the list of variants is not the same for all subpopulations. So can I just take the common subset of variants so that I can compare the LD among them?

2) For some genes, the number of variants are too many (>100, sometimes 200-300) and hence the LD plot does not appear or is uninformative (see below). How can I subset the list of variants WITHOUT LOSING LD structure? (NOTE: --indep option in PLINK is not suitable for me, I am NOT looking for independent SNPS)

LDPlot SNPs variants 1000Genome • 467 views
0
Entering edit mode
11 weeks ago

Hi Rashmi,

From the image below, you can see the list of variants is not the same for all subpopulations. So can I just take the common subset of variants so that I can compare the LD among them?

It is difficult to say... I would be interested to observe the variants that are different and to understand at which stage these are being filtered out. How are you applying the MAF filter? - globally, or per population?

2) For some genes, the number of variants are too many (>100, sometimes 200-300) and hence the LD plot does not appear or is uninformative (see below). How can I subset the list of variants WITHOUT LOSING LD structure? (NOTE: --indep option in PLINK is not suitable for me, I am NOT looking for independent SNPS)

I have not encountered this problem. Usually, more variants == 'better' result. Also, I am unsure what you mean about the --indep option. This option calculates variants in disequilibrium via VIF: https://www.cog-genomics.org/plink/1.9/ld

Can you show the code that you are using for all steps? MAF 0.05 is too low, in my opinion - I would use 0.1 - 0.2

0
Entering edit mode

It is difficult to say... I would be interested to observe the variants that are different and to understand at which stage these are being filtered out.

I am new to population genetics, but what I understand so far is that to compare LD between 2 population, I should use the same set of variants. Is that not so? My aim is to see if there is a difference in gene structure among the different populations.

How are you applying the MAF filter? - globally, or per population?

The MAF filter is applied globally - I filtered the VCF file for each chromosome using vcftools

vcftools --gzvcf ALL.chr13.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
--maf 0.05  --max-alleles 2 --recode --remove-indels --stdout


I have not encountered this problem. Usually, more variants == 'better' result.

Maybe I should increase my canvas size and the plots might display. I will try that.

Also, I am unsure what you mean about the --indep option

In the plink forum, when I asked this question, I was adviced to use this option. But my understanding of this option is that it will give a list of variants that are independent and NOT in LD with each other, which is opposite of what I am looking for.

Can you show the code that you are using for all steps?

# extract gene region for each sub population from VCF file
vcf_to_ped_convert.pl -vcf chr12_maf0.05.vcf.gz -sample_panel_file ALL.panel -region 12:112204691−112247782 -population GWD -output_ped gene1.ped -output_info gene1.info
# convert info file into map file
info_to_map.pl gene1.info > gene1.map
# calculate LD
plink1.9 --file gene1 -r2 square --out gene1 --biallelic-only --snps-only
# plot in R
library(gaston)
LD.plot(gene1_matrix)

1
Entering edit mode

In the plink forum, when I asked this question, I was adviced to use this option. But my understanding of this option is that it will give a list of variants that are independent and NOT in LD with each other, which is opposite of what I am looking for.

Yes, it identifies a set of variants that are in 'approximate' linkage equilibrium, and this is achieved by identifying and filtering out other variants that are in disequilibrium, based on a threshold, like r-squared or VIF.

I have not used gaston, unfortunately.

For haploblock analysis, you don't have to filter out any SNPs based on LD; however, you could still apply the Global MAF filter, if you wish. This being said, the haploblocks are very much defined based on r-squared or some other metric.

What you are doing we did previously for comparing 1000 Genomes populations, i.e., comparing their haplo-blocks and -types across a specific gene. If you can take a look at my part 3, here: Manhattan plots and linkage disequilibrium heatmap, and also here for a more complete walkthrough: How do I compute ld blocks from the hapmap ld_data?

The actual comparisons were then just visual, e.g., "this 'haplotype' was observed at higher frequency in South-East Asians when compared to Europeans".

1
Entering edit mode

Thank you so much for your insights! I have a clearer picture of what is to be done now. I had read the part 3 of your tutorial earlier but understood its significance to my problem only now. The ld blocks tutorial makes it clearer.