1
2
Entering edit mode
3.9 years ago
tarek.mohamed ▴ 300

Hi All,

Using 1000 genomes database, I have downloaded genotype data for 99 individuals for couple of thousands of SNPs distributed across different chromosomes, I have this data in one vcf file. I want to perform linkage disequilibrium analysis between all of these SNPs, I need the r2 and the d' values as well.

What tool you recommend for such analysis?

Thanks Tarek

LD SNPs different chromosomes • 4.8k views
2
Entering edit mode
3.9 years ago

# Short way (quick)

If you have a VCF already, you can just use VCFtools in order to do a very simple linkage disequilibrium (LD) analysis: http://vcftools.sourceforge.net/documentation.html#ld

# Long way (more flexibility and comprehensive)

Another, more roundabout approach would be to get your data from VCF to PLINK format, where you could do a more comprehensive analysis. You could have followed my tutorial (Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old) ), which includes the downloading of all 1000 Genomes Phase III data in VCF format and then converting them into PLINK format.

Here is further information for conducting LD analysis in PLINK:

If you follow my tutorial, you'll have the entire 1000 Genomes Phase III samples in PLINK, and from there you can easily filter in/out your samples of interest. See here for details: https://www.cog-genomics.org/plink/1.9/filter

For using a dataset correctly in PLINK, you should create a custom FAM file that matches your dataset and then specify this when performing LD analysis with --fam MyCustom.fam. A FAM file contains 7 columns:

1. Family ID (FID)
2. Individual ID (IID)
3. Paternal ID (PID)
4. Maternal ID (MID)
5. Gender (1, male; 2, female)
6. Phenotype/Disease status (1, control; 2, case/disease)

The file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped, which you'll get if you follow my tutorial, already contains this information, so, use that and filter out what you don't need.

Also, when reading your data from VCF/BCF into PLINK, it is critical that you specify a sample order file so that PLINK reads the samples in the order that you want and in the order that matches the samples as listed in your custom FAM. A sample command is:

plink --noweb \
--bcf My.bcf \
--keep-allele-order \
--indiv-sort file SampleSort.list \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \


The file mentioned in this command after the --indiv-sort file command-line parameter, SampleSort.list, contains 2 columns (FID IID), like this:

0  NA0165
0  NA0169
et cetera


Then, to do LD analysis in PLINK:

plink --file PlinkDataForLD \
--r2 --ld-window-kb 1000 \
--ld-window 100000 \
--ld-window-r2 0 \
--fam MyCustom.fam

0
Entering edit mode

Hi Kevin,

Thanks for your reply. Actually, I tried vcftools, but I got negative values for r2 which of course does not make sense! I am going try the long way approach, and I will let you know the updates.

Thanks, T

0
Entering edit mode

Hi Tarek,

Okay, on reflection, you may not require the complex part of creating the custom FAM, considering that all of your samples will be 'healthy' 1000 Genomes samples. The LD analysis will just look at all samples in the dataset and not use information on phenotype, gender, etc.

In that case, you possibly just need to do this:

plink --noweb --vcf My.vcf --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out PlinkDataForLD



PLINK is a very good and comprehensive analysis tool, though.

Respond here if you need help or want me to look at anything.

Kevin

0
Entering edit mode

Hi Kevin , I was able to do the analysis using a combination of bedtools, vcftools, and plink. I had 9000 SNPs distributed across 4 different genes which I did LD analysis for. Now I have LD analysis result (plink.ld) file. I want to view the analysis graphically, which tool do you recommend?

CHR_A         BP_A         SNP_A  CHR_B         BP_B         SNP_B           R2           DP
6    160543123     rs1867351      6    160543148    rs12208357    0.0215633            1
6    160543123     rs1867351      6    160543229    rs55918055   0.00124509            1
6    160543123     rs1867351      6    160543562      rs461473    0.0291014            1
6    160543123     rs1867351      6    160543610     rs4709400            1            1
6    160543123     rs1867351      6    160543934    rs62440864     0.815385            1
6    160543123     rs1867351      6    160544680   rs192748353   0.00124509            1
6    160543123     rs1867351      6    160544961    rs73025537    0.0201052            1
6    160543123     rs1867351      6    160545046   rs111909590   0.00635448            1
6    160543123     rs1867351      6    160545394      rs463599    0.0403996            1


Thnaks Tarek

0
Entering edit mode

Sorry, if I have bed/bim/fam files from 16SrRNAs and a phenotypes file like screenshot, does that mean that I should correlate these files with phenotypes files as OTU?

Yes, this an OTU, I should first convert that to PLINK format.

0
Entering edit mode

Hi Kevin,

Is your answer the same if I have WES vcf files processed by GATK from individuals from a large family - with both affected and unaffected individuals?

Would I simply need to combine these files together and then convert the vcf file into .ped format with plink?

What do you recommend doing after that? I'm having trouble figuring out how to place markers within the file to use for linkage downstream.

Any help would be greatly appreciated.

0
Entering edit mode

Hey, yes, you adopt the same general approach that I mention in my answer. I think that you'd need bcftools merge to merge the files. PLINK, then, has family-specific tests that you could use.

0
Entering edit mode

Hi Kevin, I have bim bed fam files from UK Biobank. How do I add phenotype. After I add phenotype, how can I get ped and map files for gplink.