F(St) Test? Coorelation Between Genetic Distance And Geographical Distance Between Isolates?
2
0
Entering edit mode
10.5 years ago
Adrian Pelin ★ 2.6k

I have a matrix in which the rows are isolates and columns are nucleotides at select sites where homozygous variation has been detected. Is there a way to do an Fst test? I can export this matrix into R. I never done an Fst before.

Thank you!

Update: My data consists in 6 isolates, and for every isolate, I have a vcf file, indicating variants regarding a the genome reference. So it looks something like this:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT whatever
Sample1    8139885    .    A    G    591.03    .    AB=0.342857;ABP=18.0245;AC=1;AF=0.25;AN=4;AO=24;CIGAR=1X;DP=70;DPB=3323;DPRA=0;EPP=46.8017;EPPR=94.401;GTI=0;HWE=-0;LEN=1;MEANALT=1;MQM=255;MQMR=255;NS=1;NUMALT=1;ODDS=3.62626;PAIRED=1;PAIREDR=1;PAO=6.95324e-310;PQA=0;PQR=0;PRO=6.95324e-310;QA=920;QR=1770;RO=46;RPP=46.8017;RPPR=94.401;RUN=1;SAP=55.1256;SRP=102.898;TYPE=snp;XAI=0.00803798;XAM=0.0305247;XAS=0.0224867;XRI=0.00860706;XRM=0.0107998;XRS=0.00219274;technology.illumina=1;BVAR    GT:DP:RO:QR:AO:QA    0/0/0/1:70:46:1770:24:920

This corresponds to one position where a variant has been found. The 6 files have a list of variants present in them, compared to the reference genome. As you can see, it tells me that A and G at that location are present in about a 2 to 1 ratio, since there are 46 observations for A and 24 for G, and the algorithm approximates the Frequency of the alternate allele G to be 0.25.

That being said, this is an observation for the entire population being sequenced by NGS. If the organism is tetraploid, my conclusion is that all individuals have the G allele is one out of 4 chromatids, and A in 3 out of 4 chromatids. There is not much more I can say here, is there? I do not know how many are heterozygous A/G or homozygous A or homozygous G and so on. I just know the frequency of allele A and frequency of allele G.

fst r snps • 3.9k views
ADD COMMENT
1
Entering edit mode

the worked example I always tend to explain is this one, which shows that to compute Fst you'll need to compute expected and observed heterozygosities first. but without a few lines of that input file it's complicated to suggest anything.

ADD REPLY
0
Entering edit mode

I looked through your example. In my population, I do not have information regarding frequencies of genotypes (i.e. how many AA, how many Aa, and how many aa). I only have information on what are the heterozygous loci in one population, and than I use that to compare to heterozygous loci in another population.

I have 6 isolates, but the information that I am able to extract, from NGS of the entire population, is limited. If I see in one population a 50/50 ratio at one locus for 2 different basepairs A and G, I assume that all individuals are heterozygous, with A in one chromatid, and G in the second chromatid.

How do I proceed from this?

ADD REPLY
0
Entering edit mode

Fst can be understood as a measurement of how the heterozygosity of a particular marker behaves in different populations by comparing expected and observed values. for that reason you need to work with raw genotypes, and again it's not clear how your raw data looks like. as a side note, be aware that NGS is a technique that favours homo over hetero site detection, which may affect the Fst evaluation.

ADD REPLY
0
Entering edit mode

Sorry, I guess I should have given an example of my data sooner, but I posted how my data is organized. On another note, would you please let me know where can I find more information about the bias NGS introduces for homozygous variants? Thank you for your help.

ADD REPLY
0
Entering edit mode
10.5 years ago

I see that your organism is tetraploid, but everything I know about Fst relates to diploid organisms. I regret I really don't know how different ploidy affects Fst measurements, but anyway you are treating raw NGS data as if it was a direct method, and it isn't. if you have a pool of samples some may have been enriched more than others, and for that reason you can't take the allele counts as independent values. the only way I know to work with NGS and Fst calculation is to use a diploid organism (human in our case), to sequence individual samples, and treat the results as if they were raw genotyping results. if there is a way of using NGS pooled samples data for real and trustworthy population statistics that are able to deal with the enrichment bias, I'm sorry I'm not aware of it.

regarding the NGS bias mentioned in your question's comments, it's not really a problem of NGS but of any methodology based on PCR, which may produce an imbalanced amplification. this erroneous amplification can indeed amplify an allele more than other, and it could lead to call an homozygous variant when it really is an heterozygous. a good reference for this may be this one, which in fact mentions this other one.

ADD COMMENT
0
Entering edit mode

Thank you so much for your insight. I will read on the papers and try to make sense of what am I supposed to do. One last question I wanted to ask you. What could it mean, when you have 6 very distant populations, spread across the globe, with essentially no genetic distances between these. This is something of a 0 value for an Fst test. Let's say this would happen between human populations. What could we conclude about these populations and their mode of propagation.

If these populations were to be clonal, we would expect intra-population mutations to occur at a high rate, and we would be expected to see the genetic distance to coorelate with geographical distance?

If these organisms are sexual, than sex among individuals of different populations would screw up our Fst analysis, since variation accumulated in one population due to inbreeding would be exchanged with other populations, and genetic distance would end up being minimal, regardless of geographical distance?

ADD REPLY
0
Entering edit mode

this is definitely not a question to post here, but I can't help pointing out some basic concepts about Fst. have in mind that the Fst entry on wikipedia does quite a good job, being very concise plus providing a nice bunch of good references for further reading.

  1. a Fst of zero means no noticeable divergence, and higher Fst values imply higher genetic distances
  2. genetic distance and geographic distance may have certain relationship, but evolution forces (isolation, bottlenecks, migration,...) may be more important. indeed, if you have 2 populations isolated their Fst would be higher in time, but if they are mixed the allele frequencies will be diluted, hence their Fst would diminish.
ADD REPLY

Login before adding your answer.

Traffic: 2898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6