Question: Comparing Snps Across Populations
gravatar for Andrea_Bio
10.0 years ago by
Andrea_Bio2.6k wrote:


I have a very high-level exploratory question about SNPs and comparative genomics.

Lets say I had 2 different populations of the same species and one of those populations was resistant to a particular disease and one of them was susceptible. Naturally I want to try and find out what confers resistance to the healthy population. How would I go about this using SNP data? I appreciate this is a huge question but I'm just trying to find out what areas I need to go and research in more detail.

Lets say I had a full genome sequence for individuals from both populations and knew their SNP alleles. Can the allele frequencies of SNPs in the 2 populations tell me anything? If the allele frequencies of a SNP in the 2 populations differ is that potentially intesting (although it might represent some other difference between the 2 populations other than the disease susceptibility)? How many individuals would i need each population to compare the allele frequencies?

Are there any sorts of statistical analysis I might perform? For example if I found an area of the genome had a higher/lower SNP distribution than the rest of the genome does this tell me anything? For example does a lower SNP distribution mean it is conserved and subject to positive selection? It's been a long time since I studied this so I could be remembering this all wrong.

Many thanks Thanks

comparative snp statistics • 7.1k views
ADD COMMENTlink modified 9.9 years ago by David Quigley11k • written 10.0 years ago by Andrea_Bio2.6k
gravatar for Haibao Tang
10.0 years ago by
Haibao Tang3.0k
Mountain View, CA
Haibao Tang3.0k wrote:

This is a common problem for association genetics. In theory, the approach of looking at SNP frequencies could work, but you will be confused with too many false positives. The problem is due to population structure.

The ideal case is that the two populations would differ in and only in the responsible SNP (like mutant and wild-type) - however that's not the general case. You'll most likely get at least tens of thousands of SNPs with frequency differences and you'll have no idea which SNP is involved in disease susceptibility. The more divergent your populations are, the harder it gets - say you want to find what SNP makes human speak while chimps do not, you'll end up with millions of candidate SNPs that have different freq between these species.

Do you have more information regarding where your candidate might be? that could narrow down the search so that your method will be feasible.

To your last question - based on coalescent theory, the region that has unusually low SNP rate might be resulted from selective sweep, which might indicate site undergoing positive selection.

ADD COMMENTlink modified 10.0 years ago • written 10.0 years ago by Haibao Tang3.0k

Thanks for your answer. Lets say we had the ideal case and there was one SNP with frequency differences, what frequency difference would you expect to see? For example, if the susceptible population had a MAF frequency of 5% and a major allele frequency of 95% and the tolerant population had a MAF frequency of say 20% could you say perhaps the minor allele is conferring resistance? Any other basic examples welcomed. I'm just trying to get a basic 'feel'

ADD REPLYlink written 9.9 years ago by Andrea_Bio2.6k

Can you estimate the spread of disease resistance in either population and also the penetrance of the mutation? Knowing these two might help to predict the differences of MAF.

ADD REPLYlink written 9.9 years ago by Haibao Tang3.0k
gravatar for Mrawlins
10.0 years ago by
Mrawlins420 wrote:

The largest differences in SNP distribution between the two populations are ideal targets for further study. Ideally you would find a single SNP that is 100% one way in one phenotype and 100% the other way in the other phenotype. That isn't particularly likely in most cases, though. Often even for phenotypes associated with a single SNP you don't get 100% identification due to random effects and noise in the data (mis-called SNPs, etc.)

The basic idea behind this type of analysis is that it's a classification problem (predicting which phenotype based on SNP). Any classifier (Naive Bayes, Decision Tree, Artificial Neural Network, etc.) could be of use here. Techniques like principal components analysis could help eliminate some SNPs off the bat and make subsequent analysis easier.

You want to look at the sensitivity and specificity of your classification, and maximize both with the fewest number of SNPs. Pretty much any SNP identified this way needs to be verified with further experimentation before concluding "SNP X causes phenotype Y".

ADD COMMENTlink written 10.0 years ago by Mrawlins420

Every individual counts as a single data point. The more data you have, the more reliable the results are. The sensitivity and specificity are only as accurate as 1/N at most. Assume you have a 200/N % chance of missing a useful SNP. What's the smallest N you can live with for your experiment? That's where I start with this sort of analysis. A sample size of 20-50 is probably sufficient if you corroborate your results with additional experiments (e.g. inducing disease state using controlled mutations, etc.).

ADD REPLYlink written 9.9 years ago by Mrawlins420

thanks for your answer. Do you know how many individuals i would need in each population to make draw any meaningful conclusions?

ADD REPLYlink written 10.0 years ago by Andrea_Bio2.6k
gravatar for Larry_Parnell
9.9 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

Because there could be thousands of small (1 to 100 bp) genetic differences (polymorphisms, insertions, deletions) between your resistant (R) and susceptible (S) strains, it may be necessary to reduce some of that genetic difference by back-crossing. Ugh, long time to see results...

That said, I would consider gene expression differences between the 2 strains as a second source of data of genes affected by those genetic differences. This is exactly what we did for a situation in mouse identical to what you describe. The important findings were comparing the uninfected states of the R and S strains as we learned how R was better primed to handle the challenge.

If you cannot get the SNPs involved (may be long, hard work, lots of mating and sequencing), at least the gene expression data gives you something to report. And these genes are legitimate targets for further research.

ADD COMMENTlink modified 9.9 years ago • written 9.9 years ago by Larry_Parnell16k
gravatar for David Quigley
9.9 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

A short answer is that possessing the complete sequence is helpful but not sufficient.

The classical genetics approach is to cross the two strains together (backcross, intercross, etc) and map the phenotype (susceptibility) to a locus. You then try to refine the locus using congenic strains or other genetic techniques. A classic text on this is Silver ( One modern approach (used by many groups including ours) is to use gene expression data to refine the phenotype; see the work of Robert Williams' group, or our own papers (Balmain lab) for examples. This is still a very, very hard problem.

If all you have is sequence data, one thing that hasn't been mentioned yet is that polymorphisms that change the protein sequence of a coding exons are better de novo candidates than polymorphisms in non-coding DNA. That's not at all to say that causal polymorphisms must be in exons, but if you have no idea where to look, that's a good place to start.

ADD COMMENTlink written 9.9 years ago by David Quigley11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1631 users visited in the last hour