Question: Can I run population genetics stats on my phased datasets?
gravatar for nataliagru1
14 months ago by
nataliagru150 wrote:


I would like to calculate Tajima's D for a SNP dataset of three parasite populations, Trypanosome brucei brucei, Trypanosome brucei gambiense, trypanosome brucei rhodisiense.

Initially I had 95 strains (.fastq files). .fastq files that were mapped to reference genome using GATK. Mapped outputs were in .sam, .sorted.bam files. I called snps from .sorted.bam files using samtools and vcftools. Called SNP files were outputted as VCFs.

I merged the VCFS for respective subspecies using GATK (merged trypanosome brucie brucei strains together and trypanosome brucie gambiense strains together etc.)

At the end I have 3 merged vcf files for T. brucei brucei, T. brucei gambiense and T. brucei rhodisiense.

I have phased the 3 vcf files using Beagle 4.1 which were outputted as binary VCF formats (.vcf.gz) and will now like to use pop-genome (r-package) to run population genetic statistics.

My VCF files are phased, contain full karyotype information for multiple strains (for example T. brucei brucei merged VCF file has 25 samples, these samples were separate VCF's until I merged them into one VCF).

I would like to run population genetic analysis (Taj D, theta, pi) on each of my merged VCF files. However I am wondering if this is erroneous. Will my population genetic output values be informative if I am using a datasets that contains multiple samples (merged VCF) and has information regarding the whole genome?

Apologies if there is confusion with my question or methods. Any insight would be greatly appreciated!! Thank you!!

ADD COMMENTlink written 14 months ago by nataliagru150
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1428 users visited in the last hour