Question

Differentially expressed genes between sensitive and non sensitive with a simple t-test with FDR control

0

Entering edit mode

5.7 years ago

renz.karlsruhe • 0

Hello everyone,

I want to find differentially expressed genes between sensitive and non-sensitive cells using a simple T-test and FDR control.

To generate the expression data for me:

library(affy)
library(qvalue)

cell.lines<-read.csv("sensitivity.csv",row.names=1,header=TRUE,sep=",")
pheno<-as.data.frame(cell.lines[,2],row.names=row.names(cell.lines))
names(pheno)<-c("sensitivity")
pheno<-new("AnnotatedDataFrame", data = pheno)

wangData<-ReadAffy(phenoData=pheno)
wangExpr<-rma(wangData)
sampleNames(wangExpr) 
featureNames(wangExpr) 
description(wangExpr)
annotation(wangData)
pData(phenoData(wangData))
dim(exprs(wangExpr)) 
head(exprs(wangExpr))
x<-exprs(wangExpr)

then I got as an example the following records:

1. cell.lines:
            celline     drugsens
GSM243480   PC3     s
GSM243481   DU145   s
GSM243482   LNCaP   s
GSM243483   22Rv    r

where s = sensitive and r = recurrent

2. x:
            GSM243480.CEL       GSM243481.CEL       GSM243482.CEL       GSM243483.CEL
1007_s_at   8.947219            9.126816            9.061964            9.571040
1053_at     8.892565            8.663903            8.610118            10.069438
117_at      4.657418            4.589023            4.437430            4.574685
121_at      7.232597            7.349899            7.597990            8.310655
1255_g_at   3.335694            3.359144            3.351804            3.201163

the columns are the cells and the lines the genes.

I've been thinking that I can write me a function that computes me the test:

A wrapper function to perform a t-test for all genes

my.t.test <- function(x, group, alternative = "two.sided"){
  x1   <- t(x[group[,1]])
  x2   <- t(x[group[,2]])
  tres <-  t.test(x = x1, y = x2, alternative = alternative)
  fc   <-  mean(x1) - mean(x2)
  stat <- tres$statistic
  if(alternative == "two.sided"){
    stat <- abs(stat)
  }

  res<-matrix(c(stat, tres$p.value, fc), ncol=3)
  colnames(res)<-c("statistic","p.value","fc")
  return(res)
}

test sensitiv versus non sensitiv genes:

sVr<-cbind(cell.lines$drugsens=="s",cell.lines$drugsens=="r")
colnames(sVr)<-c("sensitiv", "rezesiv")
SensitivVersusRezesiv <- apply(X=x, MARGIN=2, FUN=my.t.test, group=sVr)
SensitivVersusRezesiv <- as.data.frame(t(SensitivVersusRezesiv))
colnames(SensitivVersusRezesiv)<-c("statistic","p.value","fc")
SensitivVersusRezesiv <- cbind(t(x[0,]),SensitivVersusRezesiv)

The problem is that I now only get one record, with the statistics and p-values for one cell type, but not for the genes. At the moment I can not go any further. Do you happen to have an idea how to solve this better?

Thank you for your help!

R gene • 2.4k views

ADD COMMENT • link 5.7 years ago by renz.karlsruhe • 0

2

Entering edit mode

please tell me you've got more than one resistant cell line; and please learn limma https://bioconductor.org/packages/release/bioc/html/limma.html

ADD REPLY • link 5.7 years ago by russhh 5.7k

1

Entering edit mode

I want to find differentially expressed genes between sensitive and non-sensitive cells using a simple T-test and FDR control.

I don't think this is a good idea. There are advanced statistical frameworks for differential expression analysis.

ADD REPLY • link 5.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes, but that should be our job. Do you know how to get it out with the t-test?

enter image description here https://ibb.co/eO9BVe

ADD REPLY • link 5.7 years ago by renz.karlsruhe • 0

1

Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

ADD REPLY • link 5.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Right. Sorry, I just did not pay attention. I will pay attention to it now.

ADD REPLY • link 5.7 years ago by renz.karlsruhe • 0

0

Entering edit mode

You will also find this useful: How to add images to a Biostars post

ADD REPLY • link 5.7 years ago by GenoMax 142k

0

Entering edit mode

Using a straight t-test is a fools errand. Learn how to do it using moderated t-tests in limma

ADD REPLY • link 5.7 years ago by russhh 5.7k

0

Entering edit mode

But I can not use that. In the picture I have attached, it says that I should do it only with a t-test first. For me it does not work at the moment. Is "limma" the same?

ADD REPLY • link 5.7 years ago by renz.karlsruhe • 0

0

Entering edit mode

So this is an assignment?

ADD REPLY • link 5.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes, I have more than one. This should only be an excerpt.

ADD REPLY • link 5.7 years ago by renz.karlsruhe • 0

score 2 · Answer 1 · 2018-08-08

2

Entering edit mode

5.7 years ago

ewre ▴ 250

For the code:

First, if I am correct, you set wrong MARGIN parameter in apply function, should be 1 instead of 2. you want to calculate across genes.

Second, according to the help page of t.test, x1 and x2 should be vector instead of matrix.

Third, better to make my.t.test return a vector instead of a matrix.

For differential expression analysis of microarray data:

Limma and RankProd are good choices if you want.

ADD COMMENT • link 5.7 years ago by ewre ▴ 250

0

Entering edit mode

Good morning, now I've improved it:

Thank you for your tips and help, then I will try it with Limma. As far as I know, x1 and x2 are already a vector. I'm trying to compare the RIGHT and FALSE.

A wrapper function to perform a t-test for all genes
my.t.test <- function(y, group, alternative = "two.sided"){
  "x is a column vector of observations, take has two columns with logical values true or false"
  x1   <- t(y[group[,1]])
  x2   <- t(y[group[,2]])
  tres <-  t.test(x = x1, y = x2, alternative = alternative)
  fc   <-  mean(x1) - mean(x2)
  stat <- tres$statistic
  if(alternative == "two.sided"){
    stat <- abs(stat)
  }
  res<-matrix(c(stat, tres$p.value, fc), ncol=3)
  colnames(res)<-c("statistic","p.value","fc")
  return(res)
}

x <- exprs(wangExpr)
geneNames <- featureNames(wangExpr)
daten=list(x=x, geneid=as.character(1:nrow(x)) ,genename=geneNames , logged2=TRUE)

test sensitiv versus non sensitiv genes:
sVr<-cbind(cell.lines$drugsens=="s",cell.lines$drugsens=="r")
colnames(sVr)<-c("sensitiv", "rezesiv")
SensitivVersusRezesiv <- apply(X=x, MARGIN=1, FUN=my.t.test, group=sVr)
SensitivVersusRezesiv <- as.data.frame(t(SensitivVersusRezesiv))
colnames(SensitivVersusRezesiv)<-c("statistic","p.value","fc")

hist(SensitivVersusRezesiv$p.value, xlab = "pvalue",breaks=100)
SensitivVersusRezesiv <- SensitivVersusRezesiv[order(
  abs(SensitivVersusRezesiv$p.value), decreasing=FALSE), ]

calculate q-values:
q.val <- qvalue(SensitivVersusRezesiv$p.value)
SensitivVersusRezesiv <- cbind.data.frame(SensitivVersusRezesiv, q.val$qvalues)
names(SensitivVersusRezesiv)<-c("statistic","p.value","fc", "q.value")

plot(SensitivVersusRezesiv$p.value, q.val$qvalues)
plot(q.val)
hist(SensitivVersusRezesiv$fc, breaks=100)
plot(SensitivVersusRezesiv$fc, -log(SensitivVersusRezesiv$p.value,base = 10),
     xlab="fold change", ylab="-log10(p-value)")

computing you own FDRs using an FDR simulation:
permutDist <- function(x, group, observedStatistics){
  group<-group[sample(1:nrow(group), replace = FALSE), ]
  permStat <- apply(X=x, MARGIN=2, FUN=my.t.test, group=group)
  permStat <-as.data.frame(t(permStat))
  colnames(permStat)<-c("statistic","p.value","fc")
  #print(permStat)
  nGreater <- sapply(observedStatistics,
                     FUN = function(obs, perm){
                       return(sum(perm>obs))
                     }, permStat$statistic)
  return(nGreater)
}

SensitivVersusRezesiv <- SensitivVersusRezesiv[order(
  abs(SensitivVersusRezesiv$statistic), decreasing=TRUE), ]

B<-60
nGreaterThanObserved <- permutDist(x,sVr, SensitivVersusRezesiv$statistic)
for(bin in 2:B){
  nGreaterThanObserved <- nGreaterThanObserved  +
    permutDist(x,sVr, SensitivVersusRezesiv$statistic)
}
nGreaterThanObserved <- nGreaterThanObserved/B
FDR <- nGreaterThanObserved/rank(nGreaterThanObserved)
SensitivVersusRezesiv<- cbind.data.frame(SensitivVersusRezesiv, FDR)

With this I got the following table:

            statistic   p.value         fc          q.value     FDR
204855_at   9.587725    1.773681e-06    6.7472272   0.003023996 0
205239_at   8.533018    4.325952e-06    4.9501956   0.004603323 0
209270_at   8.330318    1.651930e-06    4.0318072   0.003023996 0
211667_x_at 8.320601    1.956686e-06    -0.4831631  0.003023996 0
200623_s_at 8.155573    6.163027e-05    -1.1984421  0.011102301 0
205070_at   8.139732    1.393233e-06    -1.4371348  0.003023996 0
210150_s_at 8.038277    1.351106e-06    1.7837117   0.003023996 0
210074_at   7.901260    1.655555e-06    4.0735272   0.003023996 0
37943_at    7.890438    2.009131e-06    1.1689482   0.003023996 0
209016_s_at 7.741336    3.320958e-06    6.3220215   0.003926546 0
210424_s_at 7.700025    2.273429e-06    1.1350697   0.003023996 0
204989_s_at 7.507732    9.725657e-06    2.9636502   0.005872305 0
201474_s_at 7.313394    4.823997e-06    2.7317807   0.004666638 0
206685_at   7.204983    6.409281e-06    1.1157027   0.005246332 0
203313_s_at 7.197138    1.899697e-04    1.9678139   0.014863984 0
206295_at   7.126566    9.389864e-06    4.8828937   0.005872305 0
213073_at   7.118477    7.881381e-06    1.4589036   0.005872305 0
206884_s_at 7.093565    3.307580e-05    5.0306353   0.009664292 0
201684_s_at 7.089161    7.378596e-05    0.7150009   0.011379270 0
214908_s_at 7.044968    5.965220e-06    -1.0282009  0.005246332 0

These are the top 20 genes, sorted by the t-statistic. I'm confused now, why the FDR = 0 is. But otherwise it looks good?

ADD REPLY • link 5.7 years ago by renz.karlsruhe • 0

1

Entering edit mode

Hi, regarding your comments: 1. x1 <- t(y[group[,1]]) will make x1 as a matrix with one row and n columns. Although in your situation, the resulting t test pvalue is not affected by this, but it's dangerous like a time bomb you never know when it will explode. 2. for FDR correction part, two things: 1). not sure if you need to correct MARGIN parameter , again. 2) since you only have 4 samples, I am really concerning about your permutation based strategy. Also, permutation based strategy for false discovery control is really a complex topic, at least in my opinion. Better to use an existing functions like p.adjust.

ADD REPLY • link 5.7 years ago by ewre ▴ 250

0

Entering edit mode

Thank you for your help. It worked well with p-adjust and went amazingly fast.

The code for the FDR:

SensitivVersusRezesiv <- SensitivVersusRezesiv[order(
  abs(SensitivVersusRezesiv$statistic), decreasing=TRUE), ]

FDR = p.adjust(SensitivVersusRezesiv$p.value, method = "fdr")
SensitivVersusRezesiv<- cbind.data.frame(SensitivVersusRezesiv, FDR)


plot(SensitivVersusRezesiv$FDR, SensitivVersusRezesiv$q.value)
fit<-lm(q.value~FDR, data=SensitivVersusRezesiv)
abline(fit)
SensitivVersusRezesiv2 = SensitivVersusRezesiv[1:10, 0:5]

Gensymbol = select(hgu133a2.db, keys = xAnnot$PROBEID , columns = "SYMBOL")
SensitivVersusRezesiv2 <-cbind(Gensymbol$SYMBOL[match(rownames(SensitivVersusRezesiv2),Gensymbol$PROBEID) ],SensitivVersusRezesiv2)
names(SensitivVersusRezesiv2)<-c("geneSymbol","statistic","p.value","fc","q.value","FDR")

The corresponding table of the 10 best:

          geneSymbol    statistic   p.value         fc          q.value     FDR
204855_at   SERPINB5    9.587725    1.773681e-06    6.7472272   0.003023996 0.006330647
205239_at   AREG        8.533018    4.325952e-06    4.9501956   0.004603323 0.009636924
209270_at   LAMB3       8.330318    1.651930e-06    4.0318072   0.003023996 0.006330647
211667_x_at TRAV12-2    8.320601    1.956686e-06    -0.4831631  0.003023996 0.006330647
200623_s_at CALM3       8.155573    6.163027e-05    -1.1984421  0.011102301 0.023242345
205070_at   ING3        8.139732    1.393233e-06    -1.4371348  0.003023996 0.006330647
210150_s_at LAMA5       8.038277    1.351106e-06    1.7837117   0.003023996 0.006330647
210074_at   CTSV        7.901260    1.655555e-06    4.0735272   0.003023996 0.006330647
37943_at    ZFYVE26     7.890438    2.009131e-06    1.1689482   0.003023996 0.006330647
209016_s_at KRT7        7.741336    3.320958e-06    6.3220215   0.003926546 0.008220110

Now it looks better. But I am still surprised, as stated above, that some p-values are larger, even though they have a strong test statistic and why are some FDRs the same? What could that be? Can I somehow test that?

ADD REPLY • link 5.7 years ago by renz.karlsruhe • 0

1

Entering edit mode

Better to have a look at BH method of FDR correction. I tried to understand that but every time after I forgot it quickly :). By the way, here is a related question you may have a glance.

ADD REPLY • link 5.7 years ago by ewre ▴ 250

0

Entering edit mode

Thanks for the tip. I have already tried it with bra, but had made no change. :( The link has helped me, because probably the most important is: "... there is never a reason to set a lower threshold just to get a higher FDR."

Accordingly, the FDR must then fit. I did not choose "BH" because I was afraid that this would be too conservative. I think that it finally fits. I think I finally understood it :). Many thanks for your help!

ADD REPLY • link 5.7 years ago by renz.karlsruhe • 0

0

Entering edit mode

your FDR was calculated based on random sampling; the reason you have zeros is because in none of the randomly-sampled runs was the sampled statistic greater than the observed value. You'd need 10000s of samples to get values greater than zero for your most-significant hits

Do you know why your t-statistic for 200623_s_at is higher than that for 205070_at, but it's p-value isn't lower?

ADD REPLY • link 5.7 years ago by russhh 5.7k

0

Entering edit mode

Okay thanks. I have a total of 22277 genes, of which only 15316 have an FDR value. Does my code agree roughly, or did I calculate the FDR incorrectly? No I do not know that. There is still a mistake somewhere. I just can not find. I've improved it once more and selected the best 10 genes:

            statistic   p.value         fc          q.value     FDR
210150_s_at 8.038277    1.351106e-06    1.7837117   0.003023996 0
205070_at   8.139732    1.393233e-06    -1.4371348  0.003023996 0
209270_at   8.330318    1.651930e-06    4.0318072   0.003023996 0
210074_at   7.901260    1.655555e-06    4.0735272   0.003023996 0
204855_at   9.587725    1.773681e-06    6.7472272   0.003023996 0
211667_x_at 8.320601    1.956686e-06    -0.4831631  0.003023996 0
37943_at    7.890438    2.009131e-06    1.1689482   0.003023996 0
210424_s_at 7.700025    2.273429e-06    1.1350697   0.003023996 0
209016_s_at 7.741336    3.320958e-06    6.3220215   0.003926546 0
205239_at   8.533018    4.325952e-06    4.9501956   0.004603323 0

ADD REPLY • link 5.7 years ago by renz.karlsruhe • 0