Hi,
I'm new to working with SNP data and I'm quite confused about how best to analyse what I have. I work with a haploid species
My SNP files are in text format:
CHR SNP_POS SAMPLE1 SAMPLE2 etc...
chr1 5 A -
chr1 12 T G
etc... for about 400,000 SNPs and 20 samples. The reason I use this format is because I use customise scripts that do extra quality control and calculate the likely base at each position based on read depth, I have no option of doing this another way. SNPs are filtered at <10% missingness in the dataset
I want to work out genome-wise nucleotide diversity based on these SNPS. My questions are:
1) for nucleotide diversity (pi): do I need to reconstruct whole genome haplotypes for each sample by substituting each appropriate base of the reference with the alternative 'SNP base' for each sample?
2) If so - any suggestions on how to do this? I've found tools that work with VCF files but not the text files I have
3) Otherwise, can I calculate pi based only on SNP data? This doesn't seem like a valid method to me.
4) I can't seem to find a programme to find pi/theta that will work with text files - I can happily reformat them within a text format - but I can't convert them to VCF.
Any clarifications of advice would be very much welcomed! Thanks