Question: VCFtools LD for haploids is nan
1
gravatar for biobio
3.3 years ago by
biobio30
United States
biobio30 wrote:

 

 

Hi, I'm trying to calculate LD for a population. According to this post: http://sourceforge.net/p/vcftools/mailman/message/33219601/

the author says that it should be fine for haploid individuals if their genotype is 1 or "1|.": 

Apologies. The announcement was slightly misinformative. The change allowed
one to calculate r^2 with genotypes of the form '1|.', but not actually for
haploids like '1'. I've just committed a change to allow the second case,
so please give it a go and drop me a line if there are any issues.

So I tried calculating LD on my vcf file like so:

    vcftools --vcf HA.diploid.vcf --hap-r2 --out test

But I got only nan in my output for the value of r^2. I then converted my VCF to have 1|. for the genotype field instead of just 1 (looks something like this):

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  M-Ait14.22010_CAAGACTA_C5E9DANXX_6_20140930B_20140930.bam.mpileup.all.snp.vcfsamp.vcf   M-Ait9.22003_GTCTGTCA_C5E9DANXX_6_20140930B_20140930.bam.mpileup.all.snp.vcfsamp.vcf    M-ELH20.18511_ATCCTGTA_C41F2ACXX_6_20140423B_20140423.bam.mpileup.all.snp.vcfsamp.vcf   M-ELH27.18516_CCTCTATC_C41F2ACXX_6_20140423B_20140423.bam.mpileup.all.snp.vcfsamp.vcf   M-ARB0.18513_GCTAACGA_C41F2ACXX_6_20140423B_20140423.bam.mpileup.all.snp.vcfsamp.vcf    M-Set0.22009_CATACCAA_C5E9DANXX_6_20140930B_20140930.bam.mpileup.all.snp.vcfsamp.vcf
1       306     .       G       T       .       PASS    AC=1;AN=1;SF=9  GT      .|.     .|.     .|.     1|.     .|.     .|.
1       375     .       A       G       .       PASS    AC=5;AN=5;SF=1,2,3,9,15 GT      1|.     1|.     .|.     1|.     1|.     1|.
1       424     .       C       A,T     .       PASS    AC=3,0;AN=3;SF=2,6,9,14,15,16,19        GT      .|.     1|.     .|.     1|.     .|.     1|.
1       465     .       G       A       .       PASS    AC=1;AN=1;SF=5,15       GT      .|.     .|.     .|.     .|.     .|.     1|.
1       508     .       T       C       .       PASS    AC=5;AN=5;SF=1,2,3,5,6,9,10,12,14,15,16,17,19,20        GT      1|.     1|.     .|.     1|.     1|.     1|.​

But my output is still nan:

CHR     POS1    POS2    N_CHR   R^2     D       Dprime
1       306     375     1       nan     0       nan
1       306     465     0       nan     nan     nan
1       306     508     1       nan     0       nan
1       306     528     1       nan     0       nan
1       306     711     1       nan     0       nan
1       306     723     1       nan     0       nan
1       306     727     0       nan     nan     nan
1       306     856     0       nan     nan     nan
1       306     864     1       nan     0       nan​

Any ideas on what I'm doing wrong? Thank you very much!

 

 

ld vcftools • 1.8k views
ADD COMMENTlink modified 3.3 years ago by Zev.Kronenberg11k • written 3.3 years ago by biobio30
1
gravatar for Zev.Kronenberg
3.3 years ago by
United States
Zev.Kronenberg11k wrote:

Probably a parsing issue.  Just backfill the "."s with a zero.  It should yield the same results. 

 

ADD COMMENTlink written 3.3 years ago by Zev.Kronenberg11k

Awesome, changing from .|. to 0|0 and 1|. to 1|0 works. Does this have any implications for the LD calculations though? 

ADD REPLYlink written 3.3 years ago by biobio30

I wouldn't do ".|." -> "0|0", that might cause errors.  The LD calculation is based on frequencies.  You are doubling the amount of haploids you are counting, but not the frequencies of the haplotypes.  You should be fine.

 

 

ADD REPLYlink written 3.3 years ago by Zev.Kronenberg11k

Hmm, doing that I get all 1s for my r^2 value:

CHR     POS1    POS2    N_CHR   R^2     D       Dprime
1       306     375     2       1       0.25    1
1       306     465     0       nan     nan     nan
1       306     508     2       1       0.25    1
1       306     528     2       1       0.25    1
1       306     711     2       1       0.25    1
1       306     723     2       1       0.25    1
1       306     727     0       nan     nan     nan
1       306     856     0       nan     nan     nan
1       306     864     2       1       0.25    1
1       306     901     0       nan     nan     nan
1       306     1019    2       1       0.25    1
1       306     1414    2       1       0.25    1
1       306     2248    2       1       0.25    1
1       306     2666    2       1       0.25    1
1       306     4987    2       1       0.25    1
1       306     5972    2       1       0.25    1
1       306     6063    2       1       0.25    1
1       306     6514    2       1       0.25    1
1       306     6557    2       1       0.25    1​
ADD REPLYlink written 3.3 years ago by biobio30

Do you only have 2 chromosome?

ADD REPLYlink written 3.3 years ago by Zev.Kronenberg11k

No there's 5 (Arabidopsis). I think the issue may be that only polymorphisms are reported and the VCF doesn't distinguish between missing data and matches to the reference. I will try writing a script that replaces "." that correspond to matching to the reference and "." that correspond to missing data.

ADD REPLYlink written 3.3 years ago by biobio30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 639 users visited in the last hour