In Python, I am using EggLib. I am trying to calculate Jost's D value per SNP found in a VCF file.
Data
Data is here in VCF format. The data set is small, there are 2 populations, 100 individuals per population and 6 SNPs (all on chromosome 1).
Each individual is named Pp.Ii
, where p
is the population index it belongs to and i
is the individual index.
Issue
I fail to understand the format of the Structure
object explained here on the EggLib website. I suppose my misunderstanding is due to the fact that the example refers to a FASTA file while I am using a VCF file and I fail to extend the logic of the Structure
object to my case.
Code
### Read the vcf file ###
vcf = egglib.io.VcfParser("MyData.vcf")
### Create the `Structure` object ###
# Dictionary for a given cluster. There is only one cluster.
dcluster = {}
# Loop through each population
for popIndex in [0,1]:
# dictionnary for a given population. There are two populations
dpop = {}
# Loop through each individual
for IndIndex in range(popIndex * 100,(popIndex + 1) * 100):
# A single list to define an individual
dpop[IndIndex] = [IndIndex*2, IndIndex*2 + 1]
dcluster[popIndex] = dpop
Structure = {0: dcluster}
### Define the population structure ###
egglib.stats.make_structure(Structure, None)
### Configurate the 'ComputeStats' object ###
cs = egglib.stats.ComputeStats()
cs.configure(only_diallelic=False)
cs.add_stats('Dj') # Jost's D
### Isolate a SNP ###
vcf.next()
site = egglib.stats.site_from_vcf(vcf)
### Calculate Jost's D ###
cs.process_site(site, struct=Structure)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/egglib/stats/_cstats.py", line 431, in process_site
self._frq.process_site(site, struct=struct)
File "/Library/Python/2.7/site-packages/egglib/stats/_freq.py", line 159, in process_site
if sum(struct) != site._obj.get_ning(): raise ValueError, 'invalid structure (sample size is required to match)'
ValueError: invalid structure (sample size is required to match)