Question: p value from multiple columns using R??
0
Ankit140 wrote:

Hi all,

I have data.frame of 6 columns. I want to calculate p.value for each row for all 6 columns. How to do this?

It might be possible values at the same row is positive or negative depending upon the samples type. So what is the script for this in R???

my data.frame looks like this:

``````                                  S1      S2         S3         S4         S5      S6
chr10%100011600%100011900 **0.96627103 0.96794047 0.96125479 0.96429540 0.97379428 0.96927928** example p.value for this?
chr10%100155000%100155300 0.91513668 0.93109224 0.93609273 0.89549083 0.92005647 0.91510770
chr10%100174800%100175100 0.02461070 0.02286753 0.02089150 0.03185857 0.03231618 0.02724574
chr10%100206600%100206900 0.07662103 0.08028024 0.08027826 0.07767043 0.08447515 0.07026636
chr10%100993500%100993800 0.06651217 0.05231399 0.04417633 0.13046962 0.09304180 0.06435386
chr10%100995600%100995900 0.13325158 0.10427350 0.10157604 0.15673318 0.14602494 0.13762823
``````
p.value statistical • 155 views
modified 4 weeks ago by Sam3.0k • written 4 weeks ago by Ankit140

What is the statistical test you want to execute?

I would like to apply unpaired two tailed t.test

If you have other suggestions please let me know for 6 or more samples?

Testing what against what?

Compare S1,S2,S3 (control) with respect to S4,S5,S6 (cases).

Basically before getting this matrix.

1. I took average of all controls (S1 - S3).

2. Subtracted all control individually from average.eg Cntrlavg -S1, Cntrlavg -S2, Cntrlavg -S3 = Control Set

3. Subtracted all cases individually from average.eg Cntrlavg -S4, Cntrlavg -S5, Cntrlavg -S6 = Case set

What you are seeing in the above matrix are the subtracted value and they can be negative too.

Now I want to compare all controls versus all cases .

How to do this?

Am I missing something after subtraction and before statistical analysis?

I think unpaired t-test or mannwhitney U test may help you in comparison.

OK can you share the R script for that? It will be very helpful.

I tried this:

``````myp.values <- apply(mymatrix[,1:6], 1, function(x) t.test(x[,1:3],x[,4:6], paired=FALSE))
``````

But it does not seems to be working.

``````mymatrix\$p.value= apply(mymatrix, 1, function(x) t.test(x[1:3],x[4:6])\$p.value)
``````

It seems it gives same p.values for all the regions? Why so? Something wrong I'm doing :

My script:

``````t.result <- apply(mymatrix, 1, function(x) t.test(mymatrix[,24:35],mymatrix[,36:39], paired=FALSE)\$p.value)
``````

I applied this on large matrix 24:35 are all controls , 36:39 are all cases.

Is the problem happening due to subtraction of all samples from average of controls?

``````apply(mymatrix, 1, function(x) t.test(x[24:35],x[36:39], paired=FALSE)\$p.value
``````

My first advise would be to use packages that do this sort of tests, for multiple reasons:

1. Your normalization might be correct but it has some underlying assumptions, you should be aware to it.
2. You have multiple hypotheses that you should correct for
3. You don't have enough power in each row for a statistical test, there are packages that could better estimate the variance and have a more powerful test.

If your input is counts I would suggest edgeR or DESeq2. Otherwise I guess limma could be used but I'm not sure if it has assumptions about the input (unless yours is microarray)

Thanks for suggestion.

My data is a methylation array. But after the subtraction from control average, the values becomes positive or negative. Now these are neither the counts, or normalized expression values or beta values or M values. They are just integers values obtained after subtractions.

If you think sample size a limitation, I will any way increase the control to 40 samples and cases to 20 samples (using public and our old data after batch adjustment).

Since Deseq2 and edgeR require raw counts, my dataset is methylation array with beta values ranged from 0 - 1, So DEseq2 or edgeR might not be the right tool here.

You mentioned about limma package. Do you think if I can input mymatrix to limma for calculating significance region wise (row-wise)?

Some methylation array packages identufy differential methylation too but here I need to select the cases based on the particular phenotypes. So I will add or remove the case to find the most observed differentially methylated loci.

The one possibility is use the original batch corrected matrix (without any subtraction) and give the same group ID to selected cases (alternative combinations) and then search for differential methylation. But that is very tedious. So I already subtracted everything from control average and want to apply statistcal test between all control v/s all/selected cases to obtain significantly different regions.

I will appreciate any help or suggestions.

Thanks

Have you looked at this paper? I think limma is appropriate here, especially if you need to do batch effect correction which can be a part of the linear model.

Does that mean you want row 1 compare with all other rows, the repeat? Or are you comparing S1 vs S2, S1 vs S3, ..., S5 vs S6?

Hi Sam,

Please see my above comment for comparison purposes. Basically I want to compare S1-S3 v/s S4-S6 for each row and add p.value in parallel column. Then apply benjamini hochberg for multiple correction

1
Sam3.0k wrote:

Using data.table (assuming your data is called dt), you can do a t.test on each row with the following code. Though as others have suggested, you should really think about your experimental design and use the appropriate statistic model, otherwise you will just get meaningless p-values.

``````library(data.table)
dt\$name <- row.names(dt)
dt <- as.data.table(dt)
dt[,{
model <- t.test(x=c(S1,S2,S3),y=c(S4,S5,S6))
list(p=model\$p.value,
z=model\$statistic)
}, by="name"]
``````