set `Treatment` as factor

Question

One-way ANOVA in R for many observations

3

Entering edit mode

6.2 years ago

darzenis ▴ 60

Hello,

I am trying to do one-way ANOVA in R to check for significant variations in biochemical concentrations between treatment groups.

This is how my data table is set up:

Treatment     Biochem_1     Biochem_2
A                   2.33                0.42
A                   3.21                0.56
B                   1.21                1.34
B                   2.11                0.99

I am able to run a simple code for one of the biochemicals (collumns):

biochem_1 <- aov(biochem_1 ~ Treatment, data = data_name)

And then get a summary of stat results:

summary(biochem_1)

The problem is that I have over a hundred biochemicals (collumns) in my data set and I do not want to run the same code and over again by manually changing the collumn name.

How could I write my code to run ANOVA for all the biochemicals?

Thank you!

R • 13k views

ADD COMMENT • link updated 12 weeks ago by crystalineocean • 0 • written 6.2 years ago by darzenis ▴ 60

score 12 · Answer 1 · 2018-01-25

12

Entering edit mode

6.2 years ago

Kevin Blighe 87k

Edit: 3rd December 2018

Edit: 27th December 2020

The original solution using a for loop is below. Here is a simpler / quicker solution using lapply:

# create test data
df <- data.frame(
  Treatment=c("A","A","B","B"),
  Biochem_1=c(2.33, 3.21, 1.21, 2.11),
  Biochem_2=c(0.42, 0.56, 1.34, 0.99),
  Biochem_3=c(3,2,1,1))

df
  Treatment Biochem_1 Biochem_2 Biochem_3
1         A      2.33      0.42         3
2         A      3.21      0.56         2
3         B      1.21      1.34         1
4         B      2.11      0.99         1

set `Treatment` as factor

df$Treatment <- factor(df$Treatment, levels=c("A","B"))

store all formulae in a list

formulae <- lapply(colnames(df)[2:ncol(df)], function(x) as.formula(paste0(x, " ~ Treatment")))

go through list and run `aov()`

res <- lapply(formulae, function(x) summary(aov(x, data = df)))
names(res) <- format(formulae)
res
$`Biochem_1 ~ Treatment`
            Df Sum Sq Mean Sq F value Pr(>F)
Treatment    1 1.2321  1.2321   3.111   0.22
Residuals    2 0.7922  0.3961               

$`Biochem_2 ~ Treatment`
            Df Sum Sq Mean Sq F value Pr(>F)  
Treatment    1 0.4556  0.4556   12.82 0.0699 .
Residuals    2 0.0711  0.0355                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

$`Biochem_3 ~ Treatment`
            Df Sum Sq Mean Sq F value Pr(>F)  
Treatment    1   2.25    2.25       9 0.0955 .
Residuals    2   0.50    0.25                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

extract just p-values

p <- unlist(lapply(res, function(x) x[[1]]$"Pr(>F)"[1]))
p
Biochem_1 ~ Treatment Biochem_2 ~ Treatment Biochem_3 ~ Treatment 
           0.21983666            0.06989405            0.09546597

data.frame(
  Biochemical = sub(' ~ Treatment', '', names(p)),
  pvalue = p)
                      Biochemical     pvalue
Biochem_1 ~ Treatment   Biochem_1 0.21983666
Biochem_2 ~ Treatment   Biochem_2 0.06989405
Biochem_3 ~ Treatment   Biochem_3 0.09546597

----------------------------------------------------------------------------------

Original answer:

Here you go, Sire:

data_name
  Treatment Biochem_1 Biochem_2
1         A      2.33      0.42
2         A      3.21      0.56
3         B      1.21      1.34
4         B      2.11      0.99

#Ensure that category 'A' is the reference level (does not make much difference here) 
data_name$Treatment <- factor(data_name$Treatment, levels=c("A","B"))

for (i in 2:ncol(data_name)) {
   formula <- as.formula(paste(colnames(data_name)[i], " ~ Treatment", sep=""))
   model <- aov(formula, data = data_name)

   cat("\n-----\n\n")
   cat(colnames(data_name)[i])
   cat("\n")
   print(summary(model))
}

-----

Biochem_1
            Df Sum Sq Mean Sq F value Pr(>F)
Treatment    1 1.2321  1.2321   3.111   0.22
Residuals    2 0.7922  0.3961               

-----

Biochem_2
            Df Sum Sq Mean Sq F value Pr(>F)  
Treatment    1 0.4556  0.4556   12.82 0.0699 .
Residuals    2 0.0710  0.0355                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You can also wrap everything in sink() commands so that all output goes directly to a file:

sink("Output.txt")
  My Code
sink()

ADD COMMENT • link 3.3 years ago by Kevin Blighe 87k

0

Entering edit mode

Also take a look at my code here to see how you could run many hundreds of thousands of models via parallel processing: R functions edited for parallel processing (see the section entitled Any type of LM, GLM, clogit (here: clogit))

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you! This was very helpful!

ADD REPLY • link 6.2 years ago by darzenis ▴ 60

0

Entering edit mode

@Kevin Blighe

Is it possible that I can export a table with two columns: one for the observation and one for its calculated p-value?

Thanks,

ADD REPLY • link 3.3 years ago by luongthang1908 ▴ 50

0

Entering edit mode

Yes, sorry, I have modified the procedure to do this. Start from 'go through list and run aov()'

ADD REPLY • link 3.3 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you so much, The script works perfectly.

Additionally, is it possible to have the pair-wise posthoc test p-value after ANOVA? It will be really helpful to have this statistical test data in the same result table. My apology for this additional question.

Thanks,

ADD REPLY • link 3.3 years ago by luongthang1908 ▴ 50

0

Entering edit mode

like this?

condition data             model      term      contrast null.value estimate conf.low conf.high adj.p.value
  <chr>     <list>           <list>     <chr>     <chr>         <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
1 Biochem_1 <tibble [4 × 2]> <TukeyHSD> Treatment B-A               0   -1.11    -3.82      1.60       0.220 
2 Biochem_2 <tibble [4 × 2]> <TukeyHSD> Treatment B-A               0    0.675   -0.135     1.49       0.0698
3 Biochem_3 <tibble [4 × 2]> <TukeyHSD> Treatment B-A               0   -1.5     -3.65      0.649      0.0954

ADD REPLY • link 3.3 years ago by cpad0112 21k

0

Entering edit mode

Hi cpad0112,

Thank you so much for your reply. I would like to have all data in one table like this one (suppose to have 4 variables instead of 2 as the above example):

          pvalue_anova value_constrast_A_vs_B pvalue_constrast_A_vs_C pvalue_constrast_A_vs_D
biochem_1            1                      2                       3                       4
biochem_2            8                      9                      10                      11
biochem_3           15                     16                      17                      18
          pvalue_constrast_B_vs_C pvalue_constrast_B_vs_D pvalue_constrast_C_vs_D
biochem_1                       5                       6                       7
biochem_2                      12                      13                      14
biochem_3                      19                      20                      21

Thanks,

ADD REPLY • link 3.3 years ago by luongthang1908 ▴ 50

0

Entering edit mode

I realize it's been many years but hopefully someone can still answer my question:

Is there any way I can extract the residuals instead of the p values when using the lapply method? I've tried every way I can think of but it just keeps replying "NULL"

Thank you!

ADD REPLY • link 12 weeks ago by crystalineocean • 0

score 1 · Answer 2 · 2018-01-25

1

Entering edit mode

6.2 years ago

brandontsinn ▴ 10

Here is a link to a software carpentry workshop on loops (and their faster alternatives) in R, that I think you might find to be useful.

ADD COMMENT • link 6.2 years ago by brandontsinn ▴ 10

set Treatment as factor

store all formulae in a list

go through list and run aov()

extract just p-values

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

set `Treatment` as factor

go through list and run `aov()`