Question: linkage disequilibrium analysis
0
gravatar for tarek.mohamed
6 months ago by
tarek.mohamed130
tarek.mohamed130 wrote:

Hi All,

Using 1000 genomes database, I have downloaded genotype data for 99 individuals for couple of thousands of SNPs distributed across different chromosomes, I have this data in one vcf file. I want to perform linkage disequilibrium analysis between all of these SNPs, I need the r2 and the d' values as well.

What tool you recommend for such analysis?

Thanks Tarek

ld snps different chromosomes • 565 views
ADD COMMENTlink modified 6 months ago by Kevin Blighe19k • written 6 months ago by tarek.mohamed130
1
gravatar for Kevin Blighe
6 months ago by
Kevin Blighe19k
University College London Cancer Institute
Kevin Blighe19k wrote:

Short way (quick)

If you have a VCF already, you can just use VCFtools in order to do a very simple linkage disequilibrium (LD) analysis: http://vcftools.sourceforge.net/documentation.html#ld


Long way (more flexibility and comprehensive)

Another, more roundabout approach would be to get your data from VCF to PLINK format, where you could do a more comprehensive analysis. You could have followed my tutorial (Produce PCA bi-plot for 1000 Genomes Phase III in VCF format ), which includes the downloading of all 1000 Genomes Phase III data in VCF format and then converting them into PLINK format.

Here is further information for conducting LD analysis in PLINK:

If you follow my tutorial, you'll have the entire 1000 Genomes Phase III samples in PLINK, and from there you can easily filter in/out your samples of interest. See here for details: https://www.cog-genomics.org/plink/1.9/filter

For using a dataset correctly in PLINK, you should create a custom FAM file that matches your dataset and then specify this when performing LD analysis with --fam MyCustom.fam. A FAM file contains 7 columns:

  1. Family ID (FID)
  2. Individual ID (IID)
  3. Paternal ID (PID)
  4. Maternal ID (MID)
  5. Gender (1, male; 2, female)
  6. Phenotype/Disease status (1, control; 2, case/disease)

The file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped, which you'll get if you follow my tutorial, already contains this information, so, use that and filter out what you don't need.

Also, when reading your data from VCF/BCF into PLINK, it is critical that you specify a sample order file so that PLINK reads the samples in the order that you want and in the order that matches the samples as listed in your custom FAM. A sample command is:

plink --noweb --bcf My.bcf --keep-allele-order --indiv-sort file SampleSort.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out PlinkDataForLD

The file mentioned in this command after the --indiv-sort file command-line parameter, SampleSort.list, contains 2 columns, like this:

0  NA0165
0  NA0169
et cetera

Then, to do LD analysis in PLINK:

plink --file PlinkDataForLD --r2 --ld-window-kb 1000 --ld-window 100000 --ld-window-r2 0 --fam MyCustom.fam
ADD COMMENTlink modified 6 months ago • written 6 months ago by Kevin Blighe19k

Hi Kevin,

Thanks for your reply. Actually, I tried vcftools, but I got negative values for r2 which of course does not make sense! I am going try the long way approach, and I will let you know the updates.

Thanks, T

ADD REPLYlink written 6 months ago by tarek.mohamed130

Hi Tarek,

Okay, on reflection, you may not require the complex part of creating the custom FAM, considering that all of your samples will be 'healthy' 1000 Genomes samples. The LD analysis will just look at all samples in the dataset and not use information on phenotype, gender, etc.

In that case, you possibly just need to do this:

plink --noweb --vcf My.vcf --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out PlinkDataForLD

plink --file PlinkDataForLD --r2 --ld-window-kb 1000 --ld-window 100000 --ld-window-r2 0

PLINK is a very good and comprehensive analysis tool, though.

Respond here if you need help or want me to look at anything.

Kevin

ADD REPLYlink modified 6 months ago • written 6 months ago by Kevin Blighe19k

Hi Kevin , I was able to do the analysis using a combination of bedtools, vcftools, and plink. I had 9000 SNPs distributed across 4 different genes which I did LD analysis for. Now I have LD analysis result (plink.ld) file. I want to view the analysis graphically, which tool do you recommend?

[tmm447@quser11 g1k_data_files]$ head plink.ld

CHR_A         BP_A         SNP_A  CHR_B         BP_B         SNP_B           R2           DP 
 6    160543123     rs1867351      6    160543148    rs12208357    0.0215633            1 
 6    160543123     rs1867351      6    160543229    rs55918055   0.00124509            1 
 6    160543123     rs1867351      6    160543562      rs461473    0.0291014            1 
 6    160543123     rs1867351      6    160543610     rs4709400            1            1 
 6    160543123     rs1867351      6    160543934    rs62440864     0.815385            1 
 6    160543123     rs1867351      6    160544680   rs192748353   0.00124509            1 
 6    160543123     rs1867351      6    160544961    rs73025537    0.0201052            1 
 6    160543123     rs1867351      6    160545046   rs111909590   0.00635448            1 
 6    160543123     rs1867351      6    160545394      rs463599    0.0403996            1

Thnaks Tarek

ADD REPLYlink modified 5 months ago • written 5 months ago by tarek.mohamed130

Sorry, if I have bed/bim/fam files from 16SrRNAs and a phenotypes file like screenshot, does that mean that I should correlate these files with phenotypes files as OTU?

Yes, this an OTU, I should first convert that to PLINK format.

ADD REPLYlink modified 6 months ago • written 6 months ago by F3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1579 users visited in the last hour