Question: Calculating statistically significant outlier for Pairwise Fst obtained from VCFTools
0
4.8 years ago by
Anurag20
Belgium
Anurag20 wrote:

Hi,

I calculated pairwise Fst using VCFTools:

vcftools --vcf input_data.vcf --weir-fst-pop population_1.txt --weir-fst-pop population_2.txt --out pop1_vs_pop2

what method I should use for statistical significance to determine the outlier region or loci under putative selection or differentiation.

modified 4.8 years ago • written 4.8 years ago by Anurag20
1
4.8 years ago by
United States
Zev.Kronenberg11k wrote:

Here are four suggestions:

1.  See if your Fst values fit a parametric distribution (or somewhat close).  Estimate the distributions parameters and then look up a probability.  Notice I did not say a p-value.

2.  Permute your genotypes and re-run Fst many times.  This would be considered an empirical p-value, or probability.

3.  Check out pFst.  pFst is a likelihood ratio test for allele frequency differences.  It gives you a true p-value based on a Chi-Sq lookup: https://github.com/jewmanchue/vcflib/wiki/Association-testing-with-GPAT

4. Check out Lositan.  I haver never used it, but it apparently provides significance values for Fst. http://popgen.net/soft/lositan/

To follow up on Zev's suggestion 2, and if you still want to use vcftools, you can perform a permutation by permuting the individuals defined in the population_1.txt and population_2.txt files.

May you explain bit more how that can be done

0
4.8 years ago by
Anurag20
Belgium
Anurag20 wrote:

Dear Zev,

Thanks for the Answer. I will try pFst. Lositan is not practical solution for me provided that I have more then 2 million variant positions.

Regarding,

``````t,target     -- argument: a zero based comma separated list of target individuals corrisponding to VCF columns
INFO: required: b,background -- argument: a zero based comma separated list of background individuals corrisponding to VCF columns``````

`If I understood correct, that `target means the individuals that we want to include in our analysis and background means not.

May I know how you are modelling it using PL or GL values for error correction and P-value calculation.

I have run pFst on 30 million variants.  It took about 5 hours with one cpu.

Cool, I will try and let you know.

The target group is compared to the background group.  The Allele frequencies from the target and background are estimated from the genotype likelihoods, not the genotype counts.

May you explain how we can do it in population scenario and what will be effect if we have 12 sample/population, lets say population A have 1.....12 and population B has 13....24. If we consider 0...11 as target and 12....23 as background, will I get the same output if I use If we consider 12....23 as target and 0...11 as background.

I can try this myself but as you are creator of the tool, you may already have tested it.

Best,

Anurag

Let's say you have 10 individuals 5 target and 5 background: -t 0,1,2,3,4 -b 5,6,7,8,9

If you have large ranges you can use:

perl -e 'print join ",", (0..9)'