Question: SNP enrichment, how to test for it?
3
gravatar for Adrian Pelin
4.4 years ago by
Adrian Pelin2.2k
Canada
Adrian Pelin2.2k wrote:

Hello,

I am comparing a couple of strains of a parasitic fungus, and the strains overall seem pretty similar, in a 3mb genome there are about 10k homozygous SNPs between these isolates. However, 2 isolates of interest share 400 of these SNPs.

I am interested in testing for enrichment, in other words, genes affected by these 400 SNPs, are they different then genes affected by the overall 10k SNPs?

I know how to do this in the simplest way when I have just a list_1 of genes and a subset list_2. However, here we do not have just lists, we also have amount of SNPs per gene. There are genes affected by 10 SNPs and some affected only by 1, surely this needs to be taken into account.

I usually do this using KOG/GO terms or KEGG pathways.

Can anyone propose an approach that would seem reasonable?

Thank you,

Adrian

kog go snps enrichment • 3.2k views
ADD COMMENTlink modified 4.4 years ago by mikhail.shugay3.3k • written 4.4 years ago by Adrian Pelin2.2k

Could you briefly explain which approach finally worked for you?

ADD REPLYlink written 2.0 years ago by LNA30
2
gravatar for Cytosine
4.4 years ago by
Cytosine440
Ljubljana, Slovenia
Cytosine440 wrote:

Hey, this sounds like a very interesting problem :)

Statistics not being my strong topic, I would go for a very straightforward approach.

Taking your proposed genome size of 3Mb and the 10000 homozygous SNPs into account, I would calculate first the expected average SNP density over this genome (not really realistic, but it's a start). 3Mbp/10k = 300bp

This would mean you would on average expect to encounter a SNP in your genome every 300 bp.

Now I would consider your genes that contain the 400 SNPs and calculate their (length divided with the expected SNP density) and exclude those that have this value higher than the actual number of SNPs. For example a 400bp gene with 1 snp would be excluded but a 500bp gene with 2 snps would be included. In this way you would get a set of genes that have a higher than average genomic snp density or "snp-enriched genes". Then I would use them as the subset list_2 in your enrichment analysis.

If you wanted to take into account local snp hot-spots and differing variant distributions throughout the genome, then you would need to fit a model to the data and test enrichment according to that model to get "snp-enriched genes" for your specific genome.

 

ADD COMMENTlink written 4.4 years ago by Cytosine440

Interesting approach. For the homozygous SNPs, in order to determine the average SNP density in coding regions, is it better to take the SNPs that are within genes, and divide that amount of SNPs by the total length of all genes in the genome?

Also, it's clear what I use for list_2, but what's my list_1? Is it genes affected by SNPs, all genes, or something else?

ADD REPLYlink written 4.4 years ago by Adrian Pelin2.2k

In theory, it should not have any effect as the original assumption is that snps are equally distributed throughout the genome (no matter what subsection you take it will have the same distribution). But realistically, I guess you'll get more snp-enriched genes, since the coding regions could potentially be less prone to variants and thus lower the average snp density estimate.

Also, I guess it wouldn't matter much for a fungal genome that has a lot of coding regions, where the estimate would be close to that of using the full genome.

In the case of what you're trying to answer (genes affected by these 400 SNPs, are they different then genes affected by the overall 10k SNPs?), I'd say list_1 should be only genes affected by snps and not the whole set of genes.

ADD REPLYlink written 4.4 years ago by Cytosine440
1
gravatar for biogirl
4.4 years ago by
biogirl160
European Union
biogirl160 wrote:

If you have a number of SNPs that you expect to see per gene, then you can calculate the z-score and use a cutoff to decide whether a gene is over- or underrepresented for SNPs.  

ADD COMMENTlink written 4.4 years ago by biogirl160

It seems this approach to be very similar with what Cytosine proposed, except involving a score, which is necessary. Any chance you have any tutorials handy or any R code?

ADD REPLYlink written 4.4 years ago by Adrian Pelin2.2k
0
gravatar for biogirl
4.4 years ago by
biogirl160
European Union
biogirl160 wrote:

I don't I'm afraid, as I wrote it in Matlab.  But I'm happy to talk you through what I did if you like!

ADD COMMENTlink written 4.4 years ago by biogirl160
0
gravatar for mikhail.shugay
4.4 years ago by
mikhail.shugay3.3k
Czech Republic, Brno, CEITEC
mikhail.shugay3.3k wrote:

I believe the first thing to do is to actually filter the list SNPs, leaving only those that are in coding region and are non-synonymous ones. This would automatically remove the bias for gene length, as it mostly comes from introns. Then I would actually go for the standard testing (gene ontology).

If a more detailed view is needed, I think the best way to do is to look at protein domains affected by non-synonymous SNPs and do an GO enrichment test for them.

Of note, usually the problem is quite inverse - sparse sets of genes are affected by mutations, and the functional enrichment of interest usually reveals itself when those sets are somehow "smoothed", see this paper for example.

From the other point of view, one could be interested in analyzing the synonymous/non-synonymous substitutions and polymorphisms (see McDonald-Kreitman test), to see which genes are under genes are under positive/negative selection and perform functional enrichment tests for them.

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by mikhail.shugay3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 808 users visited in the last hour