Question: Measure Significance Of Genes Associated With Multiple Phenotypes
3
gravatar for Khader Shameer
7.9 years ago by
Manhattan, NY
Khader Shameer18k wrote:

I have a dataset of gene-phenotype association in this format. I am looking at some combination of phenotypes and genes shared between combinations. I would like to use a statistical test to show that the genes shared between two phenotypes are statistically significant using a p-value or a similar measure.

For example:

22 genes are associated with Phenotype1 
205 genes are associated with Phenotype2 
9 genes are common between two phenotypes

I want to assess whether the number of genes common to two phenotypes are statistically significant or just a random observation.

I have phenotype information for 4035 genes; I assume that human genome contains 42, 071 genes

How do you address this problem (preferably in R), what statistical test you would recommend and why ?

PS. Edit on Oct 17 2011 I posted this question at stats.stackexchange.com.

statistics • 1.8k views
ADD COMMENTlink modified 7.9 years ago by Adrian Cortes490 • written 7.9 years ago by Khader Shameer18k
3

@Khader: That's the number of current entries in the gene database for Homo sapiens, which includes pseudogenes (e.g. LOC100736412), neathderthal mitochondrial genes (trnL) and hypothetical proteins (e.g. DKFZP564C152). Just FYI, since those classes of genes would not typically be used to generate the phenotype-genotype gene lists and might inflate your number of comparisons.

ADD REPLYlink written 7.9 years ago by David Quigley11k
2

This is a great question, very pertinent. Sure, you can assume that the genome is 42071 genes, but were all tested? You may need to lower that because not all genes are represented on genotyping and gene expression platforms. Such may be a reason for whole genome sequencing to identify variants and their associations as well as RNA-Seq for gene expression.

ADD REPLYlink written 7.9 years ago by Larry_Parnell16k
1

Thanks Larry. Good point, but here I used 42071 genes because my phenotype also includes diseases. Gene-disease relationship was determined using biochemical experiments, not as such from array-based or sequence based experimental platforms.

ADD REPLYlink written 7.9 years ago by Khader Shameer18k
1

Although the definition of a gene is slippery, the conventional number for "how many protein-coding genes are in the human genome?" is about 25,000. Where did you get 42,071?

ADD REPLYlink written 7.9 years ago by David Quigley11k

@David: The number is from NCBI (See: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606&lvl=3&lin=f&keep=1&srchmode=1&unlock). 25K may indicate reviewed proteins in human proteome: http://www.uniprot.org/uniprot/?query=organism:9606+keyword:181

ADD REPLYlink written 7.9 years ago by Khader Shameer18k

Yes David, thanks for your pointers. I agree using entire set of genes form NCBI may affect my analysis. In my dataset, I have associations with LOC*, hypothetical ones but not tmL. I will check this and refine it to further.

ADD REPLYlink written 7.9 years ago by Khader Shameer18k

Please note that I cross-posted this question here: stats.stackexchange.com/questions/17132/statistical-significance-of-genes-associated-with-multiple-phenotypes

ADD REPLYlink written 7.9 years ago by Khader Shameer18k
2
gravatar for Larry_Parnell
7.9 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

For our situation, which is quite similar to yours, we use a Z-score. This is described by Doniger, Conklin, et al and gives a measure of the significance of overlap between two sets. Generally, a Z-score of 1.96 means positive enrichment at p-value roughly equal to 0.05, while a negative Z-score is negative enrichment (much less overlap than expected), also with p about equal to 0.05. As Z increases in either direction, significance increases.

ADD COMMENTlink written 7.9 years ago by Larry_Parnell16k
1

A Z-score of 1.96 comes from a normal distribution (or a standard normal variate). How do you validate the assumptions for normal distribution?

ADD REPLYlink written 7.9 years ago by Arun2.3k

Thanks a lot for this, will check the manuscript.

ADD REPLYlink written 7.9 years ago by Khader Shameer18k
1
gravatar for Adrian Cortes
7.9 years ago by
Adrian Cortes490
Brisbane, Australia
Adrian Cortes490 wrote:

What about the one described here:

http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1002254

ADD COMMENTlink written 7.9 years ago by Adrian Cortes490

Thanks Adrian !!

ADD REPLYlink written 7.9 years ago by Khader Shameer18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 662 users visited in the last hour