Question: RNA Seq data with GSEA
0
gravatar for pennakiza
28 days ago by
pennakiza50
pennakiza50 wrote:

Hi all,

I am a bit confused with how I should be transforming my data to use it with GSEA. I have tried two different approaches and I am not sure which one is the best.

First, I have created a DGElist, filtered my low count reads and transformed it with voom. These data I fed then to GSEA.

y <- DGEList(counts = fc_mydata$counts, genes=fc_mydata$annotation[,c("GeneID","Length")])
keep<-filterByExpr(y)
y<-y[keep,]
y<-calcNormFactors(y, method="TMM")
y_voom<-voom(y) 
GSEA_table<-y_voom$E
colnames(GSEA_table)<-pastemydata$Treatment[match(colnames(GSEA_table),mydata$Sample)],mydata$DaysToRebound,mydata$SampleID,sep="_")
 GSEA_table<-GSEA_table[,order(colnames(GSEA_table))]
 symb<-annotation$Symbol[match(rownames(GSEA_table),annotation$GeneID)]
 write.table(file="GSEA_table_final.txt",cbind("NAME"=symb,GSEA_table),row.names = F,quote = F,sep = "\t")

Second approach, I ran DESeq2 on raw counts, transformed it with vsn and fed that to GSEA.

dds<-DESeqDataSetFromMatrix(countData=fc_mydata$counts, colData=mydata, ~Treatment)
dds<-DESeq(dds)
res<-results(dds)
res_ordered<-res[order(res$padj),]
res_ordered<- res_ordered[order(res_ordered$padj, decreasing = F),]
rownames(res_ordered)<-make.names(annotation$Symbol[match(rownames(res_ordered), annotation$GeneID)], unique=TRUE)
head(res_ordered)
write.table(res_ordered, "res_ordered",sep="\t")

vst<-vst(dds, blind=F)
normalised_vst<-assay(vst)
rownames(normalised_vst)<-make.names(annotation$Symbol[match(rownames(normalised_vst), annotation$GeneID)], unique=TRUE)
normalised_vst<-normalised_vst[,order(colnames(normalised_vst))]
write.table(normalised_vst, "normalised_vst.txt", cbind("NAME" = rownames(normalised_vst), normalised_vst),row.names = F,quote = F,sep = "\t")

What do you think, which is the most correct one? What can I do to improve?

Thanks a lot!

gsea rna-seq vsn voom • 177 views
ADD COMMENTlink modified 28 days ago by Gordon Smyth1.4k • written 28 days ago by pennakiza50
2
gravatar for Gordon Smyth
28 days ago by
Gordon Smyth1.4k
Australia
Gordon Smyth1.4k wrote:

I am the voom author, but I do not recommend voom for this purpose because GSEA is unable to use the voom precision weights. Just use:

y <- DGEList(counts ...)
y <- calcNormFactors(y)
GSEA_table <- cpm(y, log=TRUE)

You do not need filterByExpr here but, if you do use it, it should be used in conjunction with your treatment factor or design matrix.

As an alternative to GSEA, you might also consider the pathway analysis functions such as camera that come as part of limma.

ADD COMMENTlink modified 28 days ago • written 28 days ago by Gordon Smyth1.4k

Thank you Gordon! I will try camera too!

ADD REPLYlink written 27 days ago by pennakiza50
0
gravatar for dsull
28 days ago by
dsull900
UCLA
dsull900 wrote:

Look at the instructions here:

https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Using_RNA-seq_Datasets_with_GSEA

Using DESeq2 is fine (and other types of acceptable RNA-seq normalizations should also be fine).

You also may not need to vst-transform your normalized counts, per the following FAQ: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ#Should_I_use_natural_or_log_scale_data_for_GSEA.3F

DESeq2 also gives you the log fold changes (treatment compared to control) of each gene. You can just use the log fold changes to rank your gene list and then feed that ranked gene list into GSEAPreranked algorithm. This is what I generally do. If you do this, I'd recommend using lfcShrink to get more accurate log fold change values.

ADD COMMENTlink modified 28 days ago • written 28 days ago by dsull900

Thank you very much!

I thought that you needed to transform the data before but I will certainly try it without it. I have tried to used preranked but in some cases I have only a few DE genes (p-adj <0.05) and if I only use those I don't get any gene sets enriched.

ADD REPLYlink written 27 days ago by pennakiza50

Ah, for the prerank algorithm, you still use ALL your genes, not just your DE ones. So if 200 out of 40000 genes are DE, you still use all 40000 genes. Basically, you just supply a list of ALL genes ranked by their log fold change.

That's the power of GSEA -- you use every single gene and it's not based off of cutoffs/thresholds. Contrast this to gene ontology overrepresentation analysis where you only input a list of genes that are considered DE.

ADD REPLYlink modified 27 days ago • written 27 days ago by dsull900

That's cool, in this case I am getting the same results, which is reassuring!

Thank you very much for your help!

ADD REPLYlink written 26 days ago by pennakiza50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 810 users visited in the last hour