Question

Differential Gene Expression Analysis using data_RNA_Seq_v2_expression_median RSEM.Normalized

1

Entering edit mode

6.6 years ago

gokce.ouz ▴ 70

Hi All,

We have a specific gene mutation and we would like to learn how it is effective on Breast cancer.

So using the R, I get the mutation information from sequenced cases of TCGA Provisional and then stratified patients into two categories as Mutated & Wild Type. I downloaded the mRNA Expression z-Scores (RNA Seq V2 RSEM) from the cBioPortal website. I would like to look at the differentially expressed gene between these two groups but I have several questions :

The RNA seq data is Rsem.normalized, before I do any further analysis I transformed them into log2(rsem+1), that is correct right ?
For differential gene expression analysis what do you suggest me to use ? I cannot use DeSEQ2 or edgeR as they require raw counts as input.
I used limma package but I guess I get shows my data has some problem . Does it look ok or should I do something else ?

library(edgeR)
library(limma)
group = c( rep("Mut", 191), rep("WT", 660))
design <- model.matrix(~ 0 + group)
colnames(design) <- c("Mut", "WT")
y = TCGA_comb
par(mfrow=c(1,2))
v <- voom(y,design,plot = TRUE)
fit <- lmFit(v, design)
cont.matrix <- makeContrasts(PIK3CA_mutVSwt=Mut - WT,levels=design)
fit.cont <- contrasts.fit(fit, cont.matrix)
fit.cont <- eBayes(fit.cont)
plotSA(fit.cont)
summa.fit <- decideTests(fit.cont)
tab <- topTable(fit.cont, n=Inf, coef="PIK3CA_mutVSwt")

Would it be too superficial if I calculate Fold Change, p-value & FDR on my own?

 a) Fold change: Take average of each gene per group and then Log2(B)-Log2(A)

 b) p-value: t.test command of R

 c) FDR: p.adjust(pvalue,method="fdr")

Many many thanks,

Gokce

RNA-Seq limma R DGE • 4.7k views

ADD COMMENT • link 6.6 years ago by gokce.ouz ▴ 70

score 3 · Accepted Answer · 2017-09-04

3

Entering edit mode

6.6 years ago

Kristoffer Vitting-Seerup ★ 4.0k

Hi Gokce

Answers below

1) Yes you can you the log-transformation as you describe - but you could also use a smaller pseudo-count (say 0.01) punishing smaller changes less.

2) Raw count data from TCGA's RNA_Seq_V2 are available via the CDC data portal or a most of other APIs such as TCGAbiolinks

3) Don't do that - limma uses normality as an assumption (just like a t.test) - so if you dont trust the limma results you should not trust a t.test. Use voom+limma with raw counts obtained as in 2) instead (limma instead of edgeR since they perform almost identical when you have many replicates (which you have right?) - expect that limma is much much faster.

Hope this helps

ADD COMMENT • link 6.6 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

Dear Kristoffer,

Thanks a lot for your answer.

I knew I would be more comfortable with the raw data however I was being forced to work on this normalized data. Maybe it is better to insist on working on the raw data again.

So to clarify, you are suggesting me to use voom+limma on the raw count data instead of edgeR & DESeq2, right ?

ADD REPLY • link 6.6 years ago by gokce.ouz ▴ 70

2

Entering edit mode

Yes i suggest voom-limma because when you have many replicates (which you do in the TCGA data) they perform similar in benchmark. The difference then becomes that instead of waiting hours on edgeR or DESeq2 you can wait seconds/minuts on limma allowing much greater flexibility.