Different estimates of nucleotide diversity (pi) from two pipelines: pixy vs vcftools
Entering edit mode
3.2 years ago
nitinra ▴ 50

Hello all,

I am trying to calculate nucleotide diversity on 192 samples and have used vcftools and pixy to calculate it. However, the results from both pipelines are dissimilar. Is there a way to evaluate which one is the accurate estimate of nucleotide diversity?

Here is the pipeline I used:

vcftools --vcf input.vcf --max-missing 0.1 --minQ 30 --maf 0.1 --remove lowdepthindividuals --recode --recode-INFO-all --out output_filtered.vcf
bcftools +prune -l 0.2 -w 50kb output_filtered.vcf -Ov -o output_filtered_ldpruned.vcf

Pi calculations: VCFtools:

vcftools --vcf output_filtered_ldpruned.vcf --window-pi 10000 --out pi


pixy --stats pi --vcf output_filtered_ldpruned.vcf --zarr_path ./zarr \
--window_size 10000 --populations allpop.list --bypass_filtration yes \
    --bypass-invariant-sites yes --outfile_prefix results/combined

The results from VCFtools have pi estimates between 0 - 0.020 whereas the ones from pixy has estimates from 0.1 - 0.4. What could be causing the discrepancy between the two methods?

vcftools nucleotide diversity pixy • 2.0k views
Entering edit mode
18 months ago
Sumaya • 0

From what I read, it seems that vcftools includes missing data as invariant genotyped base (i. e. hom. allele as reference) and this make a biased estimates which is not the case in Pixy as it exclude any missing data.Have a look pixy paper: https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13326


Login before adding your answer.

Traffic: 1991 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6