Question

Measure Significance Of Genes Associated With Multiple Phenotypes

3

Entering edit mode

12.6 years ago

Khader Shameer 18k

I have a dataset of gene-phenotype association in this format. I am looking at some combination of phenotypes and genes shared between combinations. I would like to use a statistical test to show that the genes shared between two phenotypes are statistically significant using a p-value or a similar measure.

For example:

22 genes are associated with Phenotype1 
205 genes are associated with Phenotype2 
9 genes are common between two phenotypes

I want to assess whether the number of genes common to two phenotypes are statistically significant or just a random observation.

I have phenotype information for 4035 genes; I assume that human genome contains 42, 071 genes

How do you address this problem (preferably in R), what statistical test you would recommend and why ?

PS. Edit on Oct 17 2011 I posted this question at stats.stackexchange.com.

statistics statistics • 3.1k views

ADD COMMENT • link updated 12.5 years ago by Adrian Cortes ▴ 550 • written 12.6 years ago by Khader Shameer 18k

3

Entering edit mode

@Khader: That's the number of current entries in the gene database for Homo sapiens, which includes pseudogenes (e.g. LOC100736412), neathderthal mitochondrial genes (trnL) and hypothetical proteins (e.g. DKFZP564C152). Just FYI, since those classes of genes would not typically be used to generate the phenotype-genotype gene lists and might inflate your number of comparisons.

ADD REPLY • link 12.6 years ago by David Quigley 11k

2

Entering edit mode

This is a great question, very pertinent. Sure, you can assume that the genome is 42071 genes, but were all tested? You may need to lower that because not all genes are represented on genotyping and gene expression platforms. Such may be a reason for whole genome sequencing to identify variants and their associations as well as RNA-Seq for gene expression.

ADD REPLY • link 12.6 years ago by Larry_Parnell 16k

1

Entering edit mode

Thanks Larry. Good point, but here I used 42071 genes because my phenotype also includes diseases. Gene-disease relationship was determined using biochemical experiments, not as such from array-based or sequence based experimental platforms.

ADD REPLY • link 12.6 years ago by Khader Shameer 18k

1

Entering edit mode

Although the definition of a gene is slippery, the conventional number for "how many protein-coding genes are in the human genome?" is about 25,000. Where did you get 42,071?

ADD REPLY • link 12.6 years ago by David Quigley 11k

0

Entering edit mode

@David: The number is from NCBI (See: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606&lvl=3&lin=f&keep=1&srchmode=1&unlock). 25K may indicate reviewed proteins in human proteome: http://www.uniprot.org/uniprot/?query=organism:9606+keyword:181

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.6 years ago by Khader Shameer 18k

0

Entering edit mode

Yes David, thanks for your pointers. I agree using entire set of genes form NCBI may affect my analysis. In my dataset, I have associations with LOC*, hypothetical ones but not tmL. I will check this and refine it to further.

ADD REPLY • link 12.6 years ago by Khader Shameer 18k

0

Entering edit mode

Please note that I cross-posted this question here: stats.stackexchange.com/questions/17132/statistical-significance-of-genes-associated-with-multiple-phenotypes

ADD REPLY • link 12.5 years ago by Khader Shameer 18k

score 2 · Answer 1 · 2011-10-07

2

Entering edit mode

12.6 years ago

Larry_Parnell 16k

For our situation, which is quite similar to yours, we use a Z-score. This is described by Doniger, Conklin, et al and gives a measure of the significance of overlap between two sets. Generally, a Z-score of 1.96 means positive enrichment at p-value roughly equal to 0.05, while a negative Z-score is negative enrichment (much less overlap than expected), also with p about equal to 0.05. As Z increases in either direction, significance increases.

ADD COMMENT • link 12.6 years ago by Larry_Parnell 16k

1

Entering edit mode

A Z-score of 1.96 comes from a normal distribution (or a standard normal variate). How do you validate the assumptions for normal distribution?

ADD REPLY • link 12.5 years ago by Arun 2.4k

0

Entering edit mode

Thanks a lot for this, will check the manuscript.

ADD REPLY • link 12.6 years ago by Khader Shameer 18k

Ram · Answer 2 · 2011-10-18

1

Entering edit mode

12.5 years ago

Adrian Cortes ▴ 550

What about the one described here:

http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1002254

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.5 years ago by Adrian Cortes ▴ 550

0

Entering edit mode

Thanks Adrian !!

ADD REPLY • link 12.5 years ago by Khader Shameer 18k