Do I need to reconstruct haplotypes from SNP data to calculate nucleotide diversity?
Entering edit mode
3.3 years ago


I'm new to working with SNP data and I'm quite confused about how best to analyse what I have. I work with a haploid species

My SNP files are in text format:


chr1 5 A -

chr1 12 T G

etc... for about 400,000 SNPs and 20 samples. The reason I use this format is because I use customise scripts that do extra quality control and calculate the likely base at each position based on read depth, I have no option of doing this another way. SNPs are filtered at <10% missingness in the dataset

I want to work out genome-wise nucleotide diversity based on these SNPS. My questions are:

1) for nucleotide diversity (pi): do I need to reconstruct whole genome haplotypes for each sample by substituting each appropriate base of the reference with the alternative 'SNP base' for each sample?

2) If so - any suggestions on how to do this? I've found tools that work with VCF files but not the text files I have

3) Otherwise, can I calculate pi based only on SNP data? This doesn't seem like a valid method to me.

4) I can't seem to find a programme to find pi/theta that will work with text files - I can happily reformat them within a text format - but I can't convert them to VCF.

Any clarifications of advice would be very much welcomed! Thanks

haplotypes SNPs nucleotide diversity • 1.0k views

Login before adding your answer.

Traffic: 1026 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6