Question: Running ANOVA in R on three tables of gene expression data
4
gravatar for Volka
2.2 years ago by
Volka120
Volka120 wrote:

Hi all, I am currently trying to reproduce some results of previous work done at a lab. What I have right now are gene expression data in tables for three ethnic groups, with columns formatted as such:

ID Batch Gender Gene1 Gene2 Gene3 ...

..and so on for about 20 thousand genes. I have three of these tables for three ethnic groups, and I would like to know how to perform ANOVA in R on these three groups, in order to get the most differentially expressed genes. I have already tried limma for this, and I am looking to use ANOVA now to compare the results.

Thanks in advance!

ADD COMMENTlink modified 7 months ago by janainamace10 • written 2.2 years ago by Volka120

Hello. This was the best tutorial on how to apply the command in the analysis of gene expression. It worked perfectly with my data. Please, if I want to store all the values ​​shown by the print command, is it possible?

ADD REPLYlink written 7 months ago by janainamace10

Take a look at write.table()

ADD REPLYlink written 7 months ago by Kevin Blighe66k
8
gravatar for Kevin Blighe
2.2 years ago by
Kevin Blighe66k
Kevin Blighe66k wrote:

You should first aim to merge your datasets. To do this, they'll require common colnames. Here's a reproducible example using random-generated data:

data1 <- matrix(rexp(200, rate=.1), ncol=20)
data2 <- matrix(rexp(400, rate=.1), ncol=20)
data3 <- matrix(rexp(100, rate=.1), ncol=20)

colnames(data1) <- paste("gene", 1:ncol(data1), sep="")
colnames(data2) <- paste("gene", 1:ncol(data2), sep="")
colnames(data3) <- paste("gene", 1:ncol(data3), sep="")

data1 <- data.frame(rep("Black", nrow(data1)), data1)
colnames(data1)[1] <- "ethnicity"
data2 <- data.frame(rep("Asian", nrow(data2)), data2)
colnames(data2)[1] <- "ethnicity"
data3 <- data.frame(rep("Hispanic", nrow(data3)), data3)
colnames(data3)[1] <- "ethnicity"

merge <- rbind(data1, data2, data3)
merge[,1:5]

   ethnicity       gene1       gene2       gene3       gene4
1      Black  1.16514388 12.30338069  3.82688868 11.36645599
2      Black 24.38089444  9.03542219  6.80817973  0.10700084
3      Black 17.43543444 12.92760579 35.47927735  5.04846676
...
9      Black 12.98195603 21.59992268  0.71956284  3.55260945
10     Black 16.53245930  3.64460684 18.28704018 13.95026225
11     Asian  6.50373870  1.76874482 41.11393353 22.17943616
12     Asian  6.58345454 13.25700767  2.04473476  8.48304042
13     Asian 22.83291833  0.06681459  2.58220889  7.63248332
...
27     Asian  7.97365692  7.67312685  4.41263392 10.72628836
28     Asian  1.87894546  1.47503110  9.30714773 16.05473038
29     Asian  9.10527813 11.70059070 26.77912256  2.67517565
30     Asian  2.11775040  0.62188180 15.13412334  7.83838828
31  Hispanic 27.56932722  1.56425509  9.30974219  6.38043914
32  Hispanic  3.40418082 19.97802677  8.17777047  2.53841623
33  Hispanic 31.19901835  5.39049237  0.43565871  6.44909321
34  Hispanic  5.12689485 11.20531531 17.32246742  5.86373778
35  Hispanic 12.60268618  9.75769548 12.34081183 27.50402493

Then, set the ethnicity variable as categorical:

merge$ethnicity <- factor(merge$ethnicity, levels=c("Asian","Black","Hispanic"))
merge$ethnicity
 [1] Black    Black    Black    Black    Black    Black    Black    Black   
 [9] Black    Black    Asian    Asian    Asian    Asian    Asian    Asian   
[17] Asian    Asian    Asian    Asian    Asian    Asian    Asian    Asian   
[25] Asian    Asian    Asian    Asian    Asian    Asian    Hispanic Hispanic
[33] Hispanic Hispanic Hispanic
Levels: Asian Black Hispanic

Then. we can do either parametric or non-parametric ANOVA:

parametric ANOVA (slow; simple for loop):

baseformula <- " ~ ethnicity"
for (i in 2:ncol(merge)) {
    formula <- paste(colnames(merge)[i], baseformula, sep="")

    p <- summary(aov(as.formula(formula), data=merge))[[1]][["Pr(>F)"]][1]

    print(paste(formula, ": p=", p, sep=""))
}
[1] "gene1 ~ ethnicity: p=0.622349829393588"
[1] "gene2 ~ ethnicity: p=0.63661393769293"
[1] "gene3 ~ ethnicity: p=0.948436180231233"
[1] "gene4 ~ ethnicity: p=0.875863416358478"
[1] "gene5 ~ ethnicity: p=0.306303439037517"
[1] "gene6 ~ ethnicity: p=0.622549894820414"
...
[1] "gene15 ~ ethnicity: p=0.848376444462962"
[1] "gene16 ~ ethnicity: p=0.802850599451517"
[1] "gene17 ~ ethnicity: p=0.756847943886486"
[1] "gene18 ~ ethnicity: p=0.431368157733262"
[1] "gene19 ~ ethnicity: p=0.780660298718924"
[1] "gene20 ~ ethnicity: p=0.819316198089084"

non-parmetric ANOVA (Kruskal-Wallis test) (slow; simple for loop):

baseformula <- " ~ ethnicity"
for (i in 2:ncol(merge)) {
    formula <- paste(colnames(merge)[i], baseformula, sep="")

    p <- kruskal.test(as.formula(formula), data=merge)$p.value

    print(paste(formula, ": p=", p, sep=""))
}

[1] "gene1 ~ ethnicity: p=0.329166862417036"
[1] "gene2 ~ ethnicity: p=0.761201522936653"
[1] "gene3 ~ ethnicity: p=0.812129687344178"
[1] "gene4 ~ ethnicity: p=0.553535953984283"
[1] "gene5 ~ ethnicity: p=0.692051268905425"
[1] "gene6 ~ ethnicity: p=0.77824469542416"
...
[1] "gene15 ~ ethnicity: p=0.729788874269058"
[1] "gene16 ~ ethnicity: p=0.894555285927434"
[1] "gene17 ~ ethnicity: p=0.759029765430655"
[1] "gene18 ~ ethnicity: p=0.178257916287501"
[1] "gene19 ~ ethnicity: p=0.634599044972864"
[1] "gene20 ~ ethnicity: p=0.852143788966208"

---------------------------------------------------------------

You have Batch and Gender in your data, too. If you want to adjust for those, you may instead consider multinomial logistic regression and including them as covariates, i.e., same as above but:

glm(ethnicity ~ gene1 + Batch + Gender, data=merge)

Kevin

ADD COMMENTlink modified 7 months ago • written 2.2 years ago by Kevin Blighe66k

Hi Kevin, thanks very much for the reply. I am able to follow up until the function for the parametric ANOVA. Here, it is giving me this error:

Error in levels(x)[x] : only 0's may be mixed with negative subscripts In addition: Warning messages: 1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored 2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors

The 'formula' variable at the very beginning of my loop is "X7896863 ~ Ethnic_group", for my first gene. I am not sure where the negative subscript may have come in. Do you have any suggestions on what the problem is?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Volka120
1

If you have hyphens in your gene names or your factor levels, then that will cause an error. You should change these to underscores ("_") or full-stops ("."), or something else. Your gene data should also be numeric. What happens when you run:

as.numeric(MyData$X7896863)

?

What about:

MyData$Ethnic_group

?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Kevin Blighe66k

Ah yes, MyData$Ethnic_group worked! Thank you!

ADD REPLYlink written 2.2 years ago by Volka120
1

Okay, great!

ADD REPLYlink written 2.2 years ago by Kevin Blighe66k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1517 users visited in the last hour