Question

Creating a score for gene-ontology representation

0

Entering edit mode

4.5 years ago

n.anuragsharma ▴ 30

I have a group of genes (belonging to a certain development pathway, eg. known to increase trichome number) and their expression data from RNA seq. I have the log2 fold changes of each sample (3 mutants) relative to the control (wildtype) as computed by the edgeR package.

I have been tasked with creating a single score which indicates whether the trichome development genes as a group are:

1) differentially regulated between mutants (log2 fold > 1.5),
2) whether this score correlates with the observed phenotype and
3) does the score take into account the number of differentially regulated and statistically significant (FDR < 0.05) genes that contribute to the score.

I have the following number of genes that fit the first two criteria in 3 mutants: 20 genes, 130 genes and 145 genes. I have calculated a score using these as follows:

I scaled the gene expression data (comprised of values ranging from -6.0 to +4.0 to) to lie between 0-1 and then computed their geometric means and this gives me scores of 0.0004, 0.0021 and 0.02 and these three correlate very well with the observed phenotype (barely any trichomes, a few and a lot of trichomes for mutant 1,2 & 3).

I have three problems, however:

a) is there a better way to scale the numbers such that they don't lead to a small score (0.0004/0.0021)?
b) I'm at a loss as to how to account for the vastly different number of genes contributing to the above scores (i.e. 20, 130 and 145).
c) Is there a way to assess how good this score is in some statistical manner?

RNA-Seq R Scaling Normalisation • 710 views

ADD COMMENT • link updated 4.5 years ago by Jean-Karim Heriche 27k • written 4.5 years ago by n.anuragsharma ▴ 30

score 1 · Answer 1 · 2019-11-14

a - Why do you take the geometric mean of the rescaled logs? You can just take the arithmetic mean. Whichever score you get, you can always rescale it to suit your needs.
b - If you want your score to reflect the number of samples, you can start by just taking the sum of the individual scores, not the mean. You could also generate different scores that address each issue separately and then combine (e.g. average) them.
c - You can use resampling methods, i.e. bootstrapping to compute confidence intervals or permutation tests for significance testing.