Newbie Methylation And Stats Question
1
2
Entering edit mode
12.1 years ago
gbayon ▴ 170

Hi everybody.

As a newbie to bioinformatics, it is not uncommon to find difficulties in the way biological knowledge mixes with statistics. I come from the Machine Learning field, and usually have problems with the naming conventions (well, among several other things, I must admit). Besides, I am not an expert in statistics, having used the barely necessary for the validation of my work.

Well, let's try to be more precise. One of the topics I am working more right now is the analysis of methylation array data. As you surely now, the final processed (and normalized) beta values are presented in a pxn matrix, where there are p different probes and n different samples or individuals from which we have obtained the beta-values. I am not currently working with the raw data.

Imagine, for a moment, that we have identified two regions of probes, A and B, with a group of nA probes belonging to A, another group (of nB probes) that belongs to B, and the intersection is empty. Say that we want to find a way to show there is a statistically significant difference between the methylation values of both regions.

As far as I have seen in the literature, comparisons (statistical tests) are always done comparing the same probe values between case and control groups of individuals or samples. For example, when we are trying to find differentiated probes.

However, if I think of directly comparing all the beta values from region A (nA * n values) against the ones in region B (nB * n values) with a, say, t test, I get the suspicion that something is not being done the way it should. My knowledge of Biology and Statistics is still limited and I cannot explain why, but I have the feeling that there is something formally wrong in this approximation. Am I right?

What I have done in similar experiments has been to find differentiated probes, and then do a test to the proportion of differentiated probes to total number of them, so I could assign a p-value to prove that there was a significant influence of the region of reference.

Several questions here: which could be a coherent approximation to the regions A and B problem stated above? Is there any problem with methylation data I am not aware of which makes only the in-probe analysis valid? Any bibliographic references that could help me seeing the subtleties around?

As you can see, concepts are quite interleaved in my mind, so any help would be very appreciated.

Regards, Gustavo

methylation • 2.9k views
3
Entering edit mode
12.1 years ago

I would start by saying that both these tests should be applied in the case where samples can be considered as independent. In your example of regions A and B, this is the case (ignoring the fact that the measures on each region come from the same cell population and are therefore correlated in that sense). In the case where you would have an overlap between A and B regions, then you wouldn't be able to apply these tests anymore.

Then, you ask about two possibilities: a t-test of mean comparison and another possible test which is (correct me if I misunderstood): take all DE probes (you already apply a test here) from each sample and compare the proportion of DE probes between samples. This is a proportion test.

The first approach seems much more sound from a statistical point of view than the second one. Indeed, in the second test, you will drastically reduce the overall information you are using (values wil be reduced to binary values: DE//not-DE) and this will reduce the power of your statistics. I don't know why you are uncomfortable with the t-test, maybe it doesn't sound fancy enough? ;-)

0
Entering edit mode

Hi Leonor.

First of all, thank you for your kind reply. I think the main problem I am having is related to the fact that is difficult for me to put down in words what I really intend to say. That's why I like places like Biostars or StackOverflow, because they let me try to define this thought problems through the use of written dialogue. This is to say I am not uncomfortable with the t-test. Actually, I think that, since I jumped from ML, we (the test and me) have developed a good and respectful relationship. ;)

(Let's head on to the problem, Gus). Well, if I am trying to see if two samples of beta values coming from the same probe are significantly differenced, I do not have any thought problem, since I think of the beta values from a single probe as a marginal distribution from the general, multivariate an unknown one from which we are sampling our data. In that case, I am making inferences between subsamples of the same sample, both of them obtained according to a given criterion (for example, the typical classification problem between control and cancer samples). Talking informally, I think of this as "comparing by rows".

I do have problems instead when, as I stated above, I have regions defined over probes. I think that is because of my view as marginal distributions. Imagine that I have different measures of a body: arm length, leg length, etc. For me, these are the probes equivalents, so I do not have problems comparing between arm lengths, but I do have them if I am thinking about comparing arm lengths and leg lengths. More informally, "comparing by columns" seems strange to me.

If I understood you, you are telling me that, given that the regions share no probes, we could consider them independent. Even if they comprise values coming from the same individual. Can we do that? That is the most difficult point for me to understand, because, as the columns in the beta values in regions A and B stand for paired individuals (for each individual there is both a column of data in region A and B), I really have difficulty for considering them independent.

Your point of view on the power of the proportion test (the last paragraph) was very inspiring. I did not think about it that way, and know I think you are completely right. :)

1
Entering edit mode

This is quite a long comment... But let's try to stay focused on the question.

You say that comparing probes from different regions would seem like comparing arm lengths and leg lengths. I do not agree, and this is actually one of the basis of µarray analyses (comparing different probes to get the most DE ones and then work on these regions). From the conceptual point of view, arm lengths and leg lengths are not homogeneous (legs are always longer than arms independently of most biological factors). However, probes are homogeneous: there is no reason (outside the biological effect you are looking for) to expect probeX to have a larger value than probeY.

The other point you raise concerns the independance between probes given that they all come from the same individuals. I think again, you are over-complicating the problem, and not raising the correct questions. Dependance would be if region A was a repetition of region B, or if they had some overlap. This is not the case here.

0
Entering edit mode

Thank you again, Leonor. I am sorry for the length of my previous comment, but I guess that is just inversely proportional to my knowledge about the problem. ;)

I think homogeneity is the key for me to understand it. Correct me if I'm wrong. It is not something general, it depends on the real problem and, for this one, our variables are homogeneous, so we can compare them. Am I right?

With respect to the other point, I think I understand your point of view, but I am less sure than in the previous one. What if some probes in region B are correlated to some in region A? Could we be talking then about dependent variables?

By the way, I am going to print your previous comment and put it on the wall just in front of me. It was very clear and precise. Thank you. I really mean it. :)

1
Entering edit mode
• homogeneity: this always depends on your problem, but the main question you should ask yourself is: are X and Y comparable or are there intrinsic differences other than the biological effect I'm trying to detect?

• independance: again, here, are A and B correlated by some factor than has no biological meaning (repetition, overlap, ...) or are they correlated by the biological effect you are working on (same transcription factor affecting their expression, or something else for methylation)? If the first is correct, you have a problem, if the second is correct, you might have a nice result.

• poster on wall: well... this is not quite what I was looking for, but if it helps.