Question: Creating a score for gene-ontology representation
gravatar for n.anuragsharma
10 months ago by
n.anuragsharma30 wrote:

I have a group of genes (belonging to a certain development pathway, eg. known to increase trichome number) and their expression data from RNA seq. I have the log2 fold changes of each sample (3 mutants) relative to the control (wildtype) as computed by the edgeR package.

I have been tasked with creating a single score which indicates whether the trichome development genes as a group are:

1) differentially regulated between mutants (log2 fold > 1.5),
2) whether this score correlates with the observed phenotype and
3) does the score take into account the number of differentially regulated and statistically significant (FDR < 0.05) genes that contribute to the score

I have the following number of genes that fit the first two criteria in 3 mutants: 20 genes, 130 genes and 145 genes. I have calculated a score using these as follows:

I scaled the gene expression data (comprised of values ranging from -6.0 to +4.0 to) to lie between 0-1 and then computed their geometric means and this gives me scores of 0.0004, 0.0021 and 0.02 and these three correlate very well with the observed phenotype (barely any trichomes, a few and a lot of trichomes for mutant 1,2 & 3).

I have three problems, however:

a) is there a better way to scale the numbers such that they don't lead to a small score (0.0004/0.0021)?
b) I'm at a loss as to how to account for the vastly different number of genes contributing to the above scores (i.e. 20, 130 and 145).
c) Is there a way to assess how good this score is in some statistical manner?

normalisation rna-seq scaling R • 247 views
ADD COMMENTlink modified 10 months ago by Jean-Karim Heriche23k • written 10 months ago by n.anuragsharma30
gravatar for Jean-Karim Heriche
10 months ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche23k wrote:

a - Why do you take the geometric mean of the rescaled logs? You can just take the arithmetic mean. Whichever score you get, you can always rescale it to suit your needs.
b - If you want your score to reflect the number of samples, you can start by just taking the sum of the individual scores, not the mean. You could also generate different scores that address each issue separately and then combine (e.g. average) them.
c - You can use resampling methods, i.e. bootstrapping to compute confidence intervals or permutation tests for significance testing.

ADD COMMENTlink written 10 months ago by Jean-Karim Heriche23k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 590 users visited in the last hour