Create linkage disequilibrium file
1
2
Entering edit mode
5.9 years ago
Dhana ▴ 110

Hi,

I am trying to calculate linkage disequilibrium (LD) values for a group of SNPs to be used as an input for the tool Locus explorer . I followed the tutorial they had mentioned in their help section to create LD file from 1000 Genomes data, When I try to replicate their example ( example - 2.1.2 Calculate LD for list of SNP against all SNPs within 1000kb region.) I get the following plink error;

Note: No phenotypes present.

Error: Duplicate ID '.'.

I checked for possible solutions to this and most of them suggested to use

--list-duplicate-vars

function of plink, in this case there are no duplicate ids and even using the function results in same error. I intend to create a file like this (InputFileFormat - LDfile);

CHR_A - Chromosome on which Index SNP is located (n.b. do not include "chr"). e.g. 2, 23

BP_A - Index SNP start coordinate (Hg19, do not include chromosome or end coordinate for in/del variants). e.g. 104356185

SNP_A - Index SNP ID

CHR_B - Chromosome for SNP in LD with Index SNP (SNP_A)

BP_B - Start coordinate (Hg19, do not include chromosome or end coordinate for in/del variants) of SNP in LD with Index SNP (SNP_A). e.g. 104315667

SNP_B - ID of SNP in LD with Index SNP (SNP_A). e.g. rs10786679, chr10:104329988:D

R2 - LD score between SNP_A and SNP_B (0 to 1). e.g. 1, 0.740917

Note: Lead SNP must be defined relative to itself for plotting purposes, e.g.:

CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2

2 173309618 rs13410475 2 173309618 rs13410475 1

2 173309618 rs13410475 2 172827293 rs148800555 0.0906124

Can anyone suggest a possible solution to this (with or without using PLINK)

Thanks.

linkage disequilibrium plink SNP LD LocusExplorer • 3.7k views
ADD COMMENT
1
Entering edit mode

Do you mind sharing your SNP file?

Try using --biallelic-only strict flag to drop SNPs that have same positions?

Added issue at GitHub, we will update the tutotrial file on next release 0.7.1

ADD REPLY
0
Entering edit mode

Hi, thank you for your reply. I tried the --biallelic-only strict flag as well, still the same error message.

The data is not my own so I am not at liberty to share it publicly. I have given a sample of the SNP file below;

snplist.txt

rs148170422

rs547289895

rs543765287

rs9747082

rs574351215

rs8071558

rs6501437

ADD REPLY
1
Entering edit mode
5.9 years ago
zx8754 11k

Here is a guide on how to get LD files when we have duplicated ID issue in the 1000 genomes vcf files. I am using your example snplist.txt file

We need to get 1000 genomes vcf for the region, usually I give 250Kb flank based on my SNPs in snplist.txt.

tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1-71324000 > genotypes.vcf

This as you described give us an error.

plink --vcf  genotypes.vcf \
--r2 \
--ld-snp-list snplist.txt \
--ld-window-kb 1000 \
--ld-window 99999 \
--ld-window-r2 0 \
--out myLD

# ...
# Error: Duplicate ID '.'.

As some SNP names are actually dots ".", they don't have an RS ID attached to them. So we need to give them unique names.

Convert to plink format

plink --vcf  genotypes.vcf \
--make-bed \
--out genotypes

Here is the cause of the problem, dots as SNP IDs:

head -n5 genotypes.bim
# 17      .       0       56      T       C
# 17      .       0       78      C       G
# 17      .       0       355     A       G
# 17      .       0       684     T       C
# 17      rs62053745      0       828     T       C

Run this R script to make unique names:

Rscript makeSNPnamesUnique.R genotypes.bim 
head -n5 genotypes.bim 
# 17      17_56_T_C       0       56      T       C
# 17      17_78_C_G       0       78      C       G
# 17      17_355_A_G      0       355     A       G
# 17      17_684_T_C      0       684     T       C
# 17      rs62053745      0       828     T       C

Now the SNP IDs are fixed, we run plink LD as usual:

plink --bfile genotypes \
--r2 \
--ld-snp-list snplist.txt \
--ld-window-kb 1000 \
--ld-window 99999 \
--ld-window-r2 0 \                   # to have smaller output file: --ld-window-r2 0.2
--out myLD

Note: If the output file is too big, then consider setting higher R2 filter value, for example --ld-window-r2 0.2, meaning only output R2 values that are more than 0.2.

Check the output

wc -l myLD.ld
# 49172 myLD.ld

head -n5 myLD.ld
#  CHR_A         BP_A             SNP_A  CHR_B         BP_B             SNP_B           R2
#     17          834         rs9747082     17           56         17_56_T_C   0.00137027
#     17          834         rs9747082     17          355        17_355_A_G   0.00151223
#     17          834         rs9747082     17          684        17_684_T_C   0.00127126
#     17          834         rs9747082     17          828        rs62053745     0.678207

Finally, we can use this file as input for LocusExplorer.

If you have any further question please post an issue at GitHub LocusExplorer.

ADD COMMENT
0
Entering edit mode

Thank you Tokhir. The solution you suggested worked for generating the file but on the other hand the generated file is so huge that it cannot be uploaded in locus explorer. For my original file, containing about 34 SNPs the generated LD file was around 50 Mb (559621 LD values), to cross verify it I tried to create a similar one for the example data provided with Locus explorer;

https://github.com/oncogenetics/LocusExplorer/blob/master/Data/CustomDataExample/Association.txt

This file resulted in a LD file of about 3.5 Gb in size, the original LD file provided as example with the tool was around 100 kb ins size and contained about 1492 LD values whereas the newly generated one contains 39534086 LD values.

Do you any suggestions for this scenario?

ADD REPLY
1
Entering edit mode

I would suggest to change this line --ld-window-r2 0 to --ld-window-r2 0.2, i.e.: keep only R2 values that have more than 0.2, that should make the output files smaller.

Also, read about other plink LD filters: ld-window, --ld-window-kb

ADD REPLY

Login before adding your answer.

Traffic: 2239 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6