Question: How to compare expression in gene list
0
3.0 years ago by
Lila M 810
UK
Lila M 810 wrote:

Hi everybody, I'm trying to analyze a data set of genes. I have 3 gene list: P1 (n =312), P2 (n =7444) and P3 (n = 553). I compared each gene list with an additional one (Z, n = 9488) in order to know the number of common genes between them, so I created the following table:

``````#my_data
overlap_Z non_overlap_Z
P1      51         261
P2     121        6232
P3      89         464
``````

Also, I create another table comparing the overlaps in Z, as follow:

``````#my_data_Z
P1   P2     P3
overlap_Z       164   3030   341
non_overlap_Z   9324  6458   9174
``````

In order to know if there are diferences between the samples, I've carried out a chi.square test in R `chisq.test(my_data, correct = FALSE)` and `chisq.test(my_data_Z, correct = FALSE).` In both cases, the p-value < 2.2e-16. So these results are saying to me that the list (P1, P2, and P3) are ratiocinated with Z. Now, I would like to know how is this correlation and which list are more correlated with Z. Which analysis do you recommend to me?

Thanks!

R statistic expresion • 743 views
modified 3.0 years ago by theobroma221.1k • written 3.0 years ago by Lila M 810
1
3.0 years ago by
Copenhagen, Denmark
Annika Forsingdal210 wrote:

Since you write "expression in gene list" I'm assuming that you are looking at gene expression data. If that's the case I would correlate fold changes or log2(fold changes) for each of your genes lists (P1-3) with fold changes in Z.

With expression in gene list, I want to say the number of overlapping. If the genes included in any list (P1, P2, P3) has more overlaps in Z, I could say that the gene list P1 are more expressed in Z. I don't have expression data, I have number of genes (or frequencies)

1
3.0 years ago by
Santosh Anand5.0k
Santosh Anand5.0k wrote:

1) The use of chi-square test is absolutely unwarranted here. Chi-square test is done to know if categorical variables are independent. The key words are 'category' and 'independence'. Category means that there is a large population, and then there are different categories to subdivide the population. Check this example from http://stattrek.com/chi-square-test/independence.aspx?Tutorial=AP

In an election survey, voters might be classified by gender (male or female) and voting preference (Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether gender is related to voting preference.

As you can see, you can divide the total population in different categories (M/F and D/R/I). And what you interested in knowing is if those categories are independent (P<0.05) or are associated. Check the above URL again to know how the hypothesis testing (Ho and Ha) is done in case of chi-square.

Now coming to your case: Your P1, P2 and P3 data are independent sets and they do not form a partition of your sample space. What I mean is that you don't have a unique big population, out of which P1, P2 and P3 -- 3 different categories of data is drawn. In fact, the way you have posed the problem, I would assume that P1, P2 and P3 are independent. There is no point in further checking their Independence thru chi-square (even if it were right!)

2)

Now, I would like to know how is this correlation and which list are more correlated with Z. Which analysis do you recommend to me?

I am unable to understand your need. First you wanted to check for independence and then a correlation -- these are mutually exclusive things! Also note that correlation means that you have many data of one kind (say x) and many data of other kind (say y), and you would like to know if there is a pattern in the data - means if one of them could be predicted from other. Think of it like this: if you plot all the x-y pairs, do you see a visual pattern on the graph (like increasing x increases/decreases y). Now coming to your data: you have just 2 (or 6 lets say, but then they are already independent P1, P2 and P3) numbers (=number of intersection gene with Z) - measuring different things. These 2 numbers will just form a point in the graph, that I told earlier -- there is no correlation you can draw from just only a single point in the graph.

Ok, let me guess: you are trying to give a p-value to the proportion of common genes matching with Z. I'm afraid that it is not possible. P-values are used for distributions (means a large number of objects), not for numbers. HTH! I'm running out of time :)

Thank you very much for this explanation. Maybe I'm trying to do something more difficult that what it really is. What I want to know is if there are any significant differences between my groups (is there any significant difference between the number of overlas in P1 and P3? and how difference are they... So as I can't use a chi-square, what can I use for this approach?

Thanks!

1
3.0 years ago by
theobroma221.1k
theobroma221.1k wrote:

This is a question which can be answered using frequency based statistics, such as conditional probability, joint probability, etc. somewhat like what Venn diagrams demonstrate. There are ways to approach this, but takes a lot of explaining to you on how to do it; like the length of the previous answer on basic statistics. Basically, I think you want to test the effect of modification, estimate/calculate the common odds ratio and then test this ratio. Look up the Mantel-Haenszel Method, and see if it applies to you. From here you can bootstrap and do Fishers exact test, but again this may be beyond your question.