Calculating statistically significant outlier for Pairwise Fst obtained from VCFTools
2
0
Entering edit mode
7.0 years ago
Anurag ▴ 20

Hi,

I calculated pairwise Fst using VCFTools:

vcftools --vcf input_data.vcf --weir-fst-pop population_1.txt --weir-fst-pop population_2.txt --out pop1_vs_pop2

What method I should use for statistical significance to determine the outlier region or loci under putative selection or differentiation.

Thanks in advance for help.

PValue Differentiation Outliers Selection • 8.6k views
ADD COMMENT
1
Entering edit mode
7.0 years ago

Here are four suggestions:

  1. See if your Fst values fit a parametric distribution (or somewhat close). Estimate the distributions parameters and then look up a probability. Notice I did not say a p-value.
  2. Permute your genotypes and re-run Fst many times. This would be considered an empirical p-value, or probability.
  3. Check out pFst. pFst is a likelihood ratio test for allele frequency differences. It gives you a true p-value based on a Chi-Sq lookup: https://github.com/jewmanchue/vcflib/wiki/Association-testing-with-GPAT
  4. Check out Lositan. I have never used it, but it apparently provides significance values for Fst.
ADD COMMENT
0
Entering edit mode

To follow up on Zev's suggestion 2, and if you still want to use vcftools, you can perform a permutation by permuting the individuals defined in the population_1.txt and population_2.txt files.

ADD REPLY
0
Entering edit mode

May you explain bit more how that can be done

ADD REPLY
0
Entering edit mode
7.0 years ago
Anurag ▴ 20

Dear Zev,

Thanks for the Answer. I will try pFst. Lositan is not practical solution for me provided that I have more then 2 million variant positions.

Regarding,

t,target     -- argument: a zero based comma separated list of target individuals corrisponding to VCF columns
INFO: required: b,background -- argument: a zero based comma separated list of background individuals corrisponding to VCF columns

If I understood correct, that target means the individuals that we want to include in our analysis and background means not.

May I know how you are modelling it using PL or GL values for error correction and P-value calculation.

ADD COMMENT
0
Entering edit mode

I have run pFst on 30 million variants.  It took about 5 hours with one cpu.

ADD REPLY
0
Entering edit mode

Cool, I will try and let you know.

ADD REPLY
0
Entering edit mode

The target group is compared to the background group.  The Allele frequencies from the target and background are estimated from the genotype likelihoods, not the genotype counts.

ADD REPLY
0
Entering edit mode

May you explain how we can do it in population scenario and what will be effect if we have 12 sample/population, lets say population A have 1.....12 and population B has 13....24. If we consider 0...11 as target and 12....23 as background, will I get the same output if I use If we consider 12....23 as target and 0...11 as background.

I can try this myself but as you are creator of the tool, you may already have tested it.

Best,

Anurag

ADD REPLY
0
Entering edit mode

Let's say you have 10 individuals 5 target and 5 background: -t 0,1,2,3,4 -b 5,6,7,8,9

If you have large ranges you can use:

perl -e 'print join ",", (0..9)'
ADD REPLY
0
Entering edit mode

Would you be able to post an example of your input and command line? I am trying to run pFst but am getting the error: more sample fields than samples listed in header

ADD REPLY

Login before adding your answer.

Traffic: 2285 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6