Question: PLINK heterozygosity - Negative F statistic?
gravatar for Caragh
14 months ago by
Caragh 40
Caragh 40 wrote:

Hi there,

I am running QC on some GWAS data on 48 samples that we are planning to do linkage analysis with. I have run the --het command in PLINK to check for excess heterozygostity and/or consanguinity.

For every sample I am getting a negative F statistic, indicating there is more heterozygosity that expected. But I'm wondering - how much heterozygosity is too much heterozygosity? Is there a threshold that would indicate sample contamination or is it solely based on the distribution amongst the 48 samples?

Here is an example of the -- het results file for a subset of the samples:

1   292543  295900  388776  -0.03604
2   299349  302200  396946  -0.03016
3   298893  302000  396663  -0.03272
4   299188  302600  397491  -0.03591
5   298827  302200  396937  -0.0354
6   274894  283200  372565  -0.09318
7   298750  302500  397353  -0.03951
8   298737  302300  397082  -0.03761
9   299640  302600  397511  -0.03138

Any help would be greatly appreciated!



heterozygosity plink gwas • 1.3k views
ADD COMMENTlink modified 10 months ago by moldach130 • written 14 months ago by Caragh 40
gravatar for RamRS
14 months ago by
Houston, TX
RamRS22k wrote:

AFAIK these are negligible negatives. Unless I'm mistaken, negative values suggest contamination and not heterozygosity.

ADD COMMENTlink written 14 months ago by RamRS22k

Thanks for your help Ram!

ADD REPLYlink written 13 months ago by Caragh 40

According to this link:

"The estimate of F can sometimes be negative. Often this will just reflect random sampling error, but a result that is strongly negative (i.e. an individual has fewer homozygotes than one would expect by chance at the genome-wide level) can reflect other factors, e.g. sample contamination events perhaps."

Ram when you said these negative values are "negligible" I assume you meant that those individuals would not need to be removed.

However, I feel it would be helpful if you could provide a more objective definition of negligible (i.e. provide numbers)? At what values do these negative scores no longer become "negligible" (e.g. -0.5, -1.8 something else) and should be removed?

ADD REPLYlink modified 10 months ago • written 10 months ago by moldach130

That's a wonderful question, and unfortunately I do not have a strong reason for picking a definite threshold. Part of it is owing to what we saw in the majority of the samples that were sequenced, and part of it was just practice that was carried over. We had a threshold of around -0.2. Samples with a lot of contamination would drop out in other steps of the pipeline as well, and by the time we got to measuring an F statistic, one would rarely see values cross -0.25 - the pipeline discarded samples at stages including sample prep, sequencing, and other metrics were also used pre-F stat QC stages. For example, we would use a filter to obtain a high quality set of variants that contributed to the F statistic (and other such statistics), so any underlying condition that affected the sample would strongly affect what we saw.

All said and done, it was still subjective to an extent in that it worked for our consortium.

ADD REPLYlink modified 10 months ago • written 10 months ago by RamRS22k

Thank you for that detailed response.

It makes sense that those samples suffering from contamination are dropping out in other (upstream) steps of the QC pipeline.

Furthermore, certain things we do in bioinformatics analysis can often be subjective in the sense that they are practices carried over (within a lab/consortium/sub-field) and not necessarily objective standards (i.e. benchmarking) but they just work

ADD REPLYlink modified 10 months ago • written 10 months ago by moldach130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1468 users visited in the last hour