Question

SNP enrichment, how to test for it?

3

Entering edit mode

9.4 years ago

Adrian Pelin ★ 2.6k

Hello,

I am comparing a couple of strains of a parasitic fungus, and the strains overall seem pretty similar, in a 3mb genome there are about 10k homozygous SNPs between these isolates. However, 2 isolates of interest share 400 of these SNPs.

I am interested in testing for enrichment, in other words, genes affected by these 400 SNPs, are they different then genes affected by the overall 10k SNPs?

I know how to do this in the simplest way when I have just a list_1 of genes and a subset list_2. However, here we do not have just lists, we also have amount of SNPs per gene. There are genes affected by 10 SNPs and some affected only by 1, surely this needs to be taken into account.

I usually do this using KOG/GO terms or KEGG pathways.

Can anyone propose an approach that would seem reasonable?

Thank you,

Adrian

SNPs GO enrichment KOG • 5.1k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

Could you briefly explain which approach finally worked for you?

ADD REPLY • link 7.0 years ago by LNA ▴ 30

Ram · Answer 1 · 2014-11-26

2

Entering edit mode

9.4 years ago

Cytosine ▴ 460

Hey, this sounds like a very interesting problem :)

Statistics not being my strong topic, I would go for a very straightforward approach.

Taking your proposed genome size of 3Mb and the 10000 homozygous SNPs into account, I would calculate first the expected average SNP density over this genome (not really realistic, but it's a start). 3Mbp/10k = 300bp

This would mean you would on average expect to encounter a SNP in your genome every 300 bp.

Now I would consider your genes that contain the 400 SNPs and calculate their (length divided with the expected SNP density) and exclude those that have this value higher than the actual number of SNPs. For example a 400bp gene with 1 snp would be excluded but a 500bp gene with 2 snps would be included. In this way you would get a set of genes that have a higher than average genomic snp density or "snp-enriched genes". Then I would use them as the subset list_2 in your enrichment analysis.

If you wanted to take into account local snp hot-spots and differing variant distributions throughout the genome, then you would need to fit a model to the data and test enrichment according to that model to get "snp-enriched genes" for your specific genome.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Cytosine ▴ 460

0

Entering edit mode

Interesting approach. For the homozygous SNPs, in order to determine the average SNP density in coding regions, is it better to take the SNPs that are within genes, and divide that amount of SNPs by the total length of all genes in the genome?

Also, it's clear what I use for list_2, but what's my list_1? Is it genes affected by SNPs, all genes, or something else?

ADD REPLY • link 9.4 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

In theory, it should not have any effect as the original assumption is that snps are equally distributed throughout the genome (no matter what subsection you take it will have the same distribution). But realistically, I guess you'll get more snp-enriched genes, since the coding regions could potentially be less prone to variants and thus lower the average snp density estimate.

Also, I guess it wouldn't matter much for a fungal genome that has a lot of coding regions, where the estimate would be close to that of using the full genome.

In the case of what you're trying to answer (genes affected by these 400 SNPs, are they different then genes affected by the overall 10k SNPs?), I'd say list_1 should be only genes affected by snps and not the whole set of genes.

ADD REPLY • link 9.4 years ago by Cytosine ▴ 460

Ram · Answer 2 · 2014-11-26

1

Entering edit mode

9.4 years ago

biogirl ▴ 210

If you have a number of SNPs that you expect to see per gene, then you can calculate the z-score and use a cutoff to decide whether a gene is over- or underrepresented for SNPs.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by biogirl ▴ 210

0

Entering edit mode

It seems this approach to be very similar with what Cytosine proposed, except involving a score, which is necessary. Any chance you have any tutorials handy or any R code?

ADD REPLY • link 9.4 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

I don't I'm afraid, as I wrote it in Matlab. But I'm happy to talk you through what I did if you like!

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by biogirl ▴ 210

score 0 · Answer 3 · 2014-11-27

I believe the first thing to do is to actually filter the list SNPs, leaving only those that are in coding region and are non-synonymous ones. This would automatically remove the bias for gene length, as it mostly comes from introns. Then I would actually go for the standard testing (gene ontology).

If a more detailed view is needed, I think the best way to do is to look at protein domains affected by non-synonymous SNPs and do an GO enrichment test for them.

Of note, usually the problem is quite inverse - sparse sets of genes are affected by mutations, and the functional enrichment of interest usually reveals itself when those sets are somehow "smoothed", see this paper for example.

From the other point of view, one could be interested in analyzing the synonymous/non-synonymous substitutions and polymorphisms (see McDonald-Kreitman test), to see which genes are under genes are under positive/negative selection and perform functional enrichment tests for them.