Question: F(St) Test? Coorelation Between Genetic Distance And Geographical Distance Between Isolates?
gravatar for Adrian Pelin
6.3 years ago by
Adrian Pelin2.3k
Adrian Pelin2.3k wrote:

I have a matrix in which the rows are isolates and columns are nucleotides at select sites where homozygous variation has been detected. Is there a way to do an Fst test? I can export this matrix into R. I never done an Fst before.

Thank you!

Update: My data consists in 6 isolates, and for every isolate, I have a vcf file, indicating variants regarding a the genome reference. So it looks something like this:

Sample1    8139885    .    A    G    591.03    .    AB=0.342857;ABP=18.0245;AC=1;AF=0.25;AN=4;AO=24;CIGAR=1X;DP=70;DPB=3323;DPRA=0;EPP=46.8017;EPPR=94.401;GTI=0;HWE=-0;LEN=1;MEANALT=1;MQM=255;MQMR=255;NS=1;NUMALT=1;ODDS=3.62626;PAIRED=1;PAIREDR=1;PAO=6.95324e-310;PQA=0;PQR=0;PRO=6.95324e-310;QA=920;QR=1770;RO=46;RPP=46.8017;RPPR=94.401;RUN=1;SAP=55.1256;SRP=102.898;TYPE=snp;XAI=0.00803798;XAM=0.0305247;XAS=0.0224867;XRI=0.00860706;XRM=0.0107998;XRS=0.00219274;technology.illumina=1;BVAR    GT:DP:RO:QR:AO:QA    0/0/0/1:70:46:1770:24:920

This corresponds to one position where a variant has been found. The 6 files have a list of variants present in them, compared to the reference genome. As you can see, it tells me that A and G at that location are present in about a 2 to 1 ratio, since there are 46 observations for A and 24 for G, and the algorithm approximates the Frequency of the alternate allele G to be 0.25.

That being said, this is an observation for the entire population being sequenced by NGS. If the organism is tetraploid, my conclusion is that all individuals have the G allele is one out of 4 chromatids, and A in 3 out of 4 chromatids. There is not much more I can say here, is there? I do not know how many are heterozygous A/G or homozygous A or homozygous G and so on. I just know the frequency of allele A and frequency of allele G.

R snps fst • 2.6k views
ADD COMMENTlink modified 6.3 years ago by Jorge Amigo11k • written 6.3 years ago by Adrian Pelin2.3k

the worked example I always tend to explain is this one, which shows that to compute Fst you'll need to compute expected and observed heterozygosities first. but without a few lines of that input file it's complicated to suggest anything.

ADD REPLYlink written 6.3 years ago by Jorge Amigo11k

I looked through your example. In my population, I do not have information regarding frequencies of genotypes (i.e. how many AA, how many Aa, and how many aa). I only have information on what are the heterozygous loci in one population, and than I use that to compare to heterozygous loci in another population.

I have 6 isolates, but the information that I am able to extract, from NGS of the entire population, is limited. If I see in one population a 50/50 ratio at one locus for 2 different basepairs A and G, I assume that all individuals are heterozygous, with A in one chromatid, and G in the second chromatid.

How do I proceed from this?

ADD REPLYlink written 6.3 years ago by Adrian Pelin2.3k

Fst can be understood as a measurement of how the heterozygosity of a particular marker behaves in different populations by comparing expected and observed values. for that reason you need to work with raw genotypes, and again it's not clear how your raw data looks like. as a side note, be aware that NGS is a technique that favours homo over hetero site detection, which may affect the Fst evaluation.

ADD REPLYlink written 6.3 years ago by Jorge Amigo11k

Sorry, I guess I should have given an example of my data sooner, but I posted how my data is organized. On another note, would you please let me know where can I find more information about the bias NGS introduces for homozygous variants? Thank you for your help.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Adrian Pelin2.3k
gravatar for Jorge Amigo
6.3 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

I see that your organism is tetraploid, but everything I know about Fst relates to diploid organisms. I regret I really don't know how different ploidy affects Fst measurements, but anyway you are treating raw NGS data as if it was a direct method, and it isn't. if you have a pool of samples some may have been enriched more than others, and for that reason you can't take the allele counts as independent values. the only way I know to work with NGS and Fst calculation is to use a diploid organism (human in our case), to sequence individual samples, and treat the results as if they were raw genotyping results. if there is a way of using NGS pooled samples data for real and trustworthy population statistics that are able to deal with the enrichment bias, I'm sorry I'm not aware of it.

regarding the NGS bias mentioned in your question's comments, it's not really a problem of NGS but of any methodology based on PCR, which may produce an imbalanced amplification. this erroneous amplification can indeed amplify an allele more than other, and it could lead to call an homozygous variant when it really is an heterozygous. a good reference for this may be this one, which in fact mentions this other one.

ADD COMMENTlink written 6.3 years ago by Jorge Amigo11k

Thank you so much for your insight. I will read on the papers and try to make sense of what am I supposed to do. One last question I wanted to ask you. What could it mean, when you have 6 very distant populations, spread across the globe, with essentially no genetic distances between these. This is something of a 0 value for an Fst test. Let's say this would happen between human populations. What could we conclude about these populations and their mode of propagation.

If these populations were to be clonal, we would expect intra-population mutations to occur at a high rate, and we would be expected to see the genetic distance to coorelate with geographical distance?

If these organisms are sexual, than sex among individuals of different populations would screw up our Fst analysis, since variation accumulated in one population due to inbreeding would be exchanged with other populations, and genetic distance would end up being minimal, regardless of geographical distance?

ADD REPLYlink written 6.3 years ago by Adrian Pelin2.3k

this is definitely not a question to post here, but I can't help pointing out some basic concepts about Fst. have in mind that the Fst entry on wikipedia does quite a good job, being very concise plus providing a nice bunch of good references for further reading.

  1. a Fst of zero means no noticeable divergence, and higher Fst values imply higher genetic distances
  2. genetic distance and geographic distance may have certain relationship, but evolution forces (isolation, bottlenecks, migration,...) may be more important. indeed, if you have 2 populations isolated their Fst would be higher in time, but if they are mixed the allele frequencies will be diluted, hence their Fst would diminish.
ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Jorge Amigo11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1765 users visited in the last hour