Calculating statistically significant outlier for Pairwise Fst obtained from VCFTools
2
0
Entering edit mode
7.6 years ago
Anurag ▴ 20

Hi,

I calculated pairwise Fst using VCFTools:

vcftools --vcf input_data.vcf --weir-fst-pop population_1.txt --weir-fst-pop population_2.txt --out pop1_vs_pop2


What method I should use for statistical significance to determine the outlier region or loci under putative selection or differentiation.

PValue Differentiation Outliers Selection • 9.2k views
1
Entering edit mode
7.6 years ago

Here are four suggestions:

1. See if your Fst values fit a parametric distribution (or somewhat close). Estimate the distributions parameters and then look up a probability. Notice I did not say a p-value.
2. Permute your genotypes and re-run Fst many times. This would be considered an empirical p-value, or probability.
3. Check out pFst. pFst is a likelihood ratio test for allele frequency differences. It gives you a true p-value based on a Chi-Sq lookup: https://github.com/jewmanchue/vcflib/wiki/Association-testing-with-GPAT
4. Check out Lositan. I have never used it, but it apparently provides significance values for Fst.
0
Entering edit mode

To follow up on Zev's suggestion 2, and if you still want to use vcftools, you can perform a permutation by permuting the individuals defined in the population_1.txt and population_2.txt files.

0
Entering edit mode

May you explain bit more how that can be done

0
Entering edit mode
7.6 years ago
Anurag ▴ 20

Dear Zev,

Thanks for the Answer. I will try pFst. Lositan is not practical solution for me provided that I have more then 2 million variant positions.

Regarding,

t,target     -- argument: a zero based comma separated list of target individuals corrisponding to VCF columns
INFO: required: b,background -- argument: a zero based comma separated list of background individuals corrisponding to VCF columns


If I understood correct, that target means the individuals that we want to include in our analysis and background means not.

May I know how you are modelling it using PL or GL values for error correction and P-value calculation.

0
Entering edit mode

I have run pFst on 30 million variants. It took about 5 hours with one cpu.

0
Entering edit mode

Cool, I will try and let you know.

0
Entering edit mode

The target group is compared to the background group. The Allele frequencies from the target and background are estimated from the genotype likelihoods, not the genotype counts.

0
Entering edit mode

May you explain how we can do it in population scenario and what will be effect if we have 12 sample/population, lets say population A have 1.....12 and population B has 13....24. If we consider 0...11 as target and 12....23 as background, will I get the same output if I use If we consider 12....23 as target and 0...11 as background.

I can try this myself but as you are creator of the tool, you may already have tested it.

Best,

Anurag

0
Entering edit mode

Let's say you have 10 individuals 5 target and 5 background: -t 0,1,2,3,4 -b 5,6,7,8,9

If you have large ranges you can use:

perl -e 'print join ",", (0..9)'

0
Entering edit mode

Would you be able to post an example of your input and command line? I am trying to run pFst but am getting the error: more sample fields than samples listed in header