I would like to calculate Tajima's D for a SNP dataset of three parasite populations, Trypanosome brucei brucei, Trypanosome brucei gambiense, trypanosome brucei rhodisiense.
Initially I had 95 strains (.fastq files). .fastq files that were mapped to reference genome using GATK. Mapped outputs were in .sam, .sorted.bam files. I called snps from .sorted.bam files using samtools and vcftools. Called SNP files were outputted as VCFs.
I merged the VCFS for respective subspecies using GATK (merged trypanosome brucie brucei strains together and trypanosome brucie gambiense strains together etc.)
At the end I have 3 merged vcf files for T. brucei brucei, T. brucei gambiense and T. brucei rhodisiense.
I have phased the 3 vcf files using Beagle 4.1 which were outputted as binary VCF formats (.vcf.gz) and will now like to use pop-genome (r-package) to run population genetic statistics.
My VCF files are phased, contain full karyotype information for multiple strains (for example T. brucei brucei merged VCF file has 25 samples, these samples were separate VCF's until I merged them into one VCF).
I would like to run population genetic analysis (Taj D, theta, pi) on each of my merged VCF files. However I am wondering if this is erroneous. Will my population genetic output values be informative if I am using a datasets that contains multiple samples (merged VCF) and has information regarding the whole genome?
Apologies if there is confusion with my question or methods. Any insight would be greatly appreciated!! Thank you!!