Question: RNA Seq data with GSEA
0
gravatar for pennakiza
13 months ago by
pennakiza60
pennakiza60 wrote:

Hi all,

I am a bit confused with how I should be transforming my data to use it with GSEA. I have tried two different approaches and I am not sure which one is the best.

First, I have created a DGElist, filtered my low count reads and transformed it with voom. These data I fed then to GSEA.

y <- DGEList(counts = fc_mydata$counts, genes=fc_mydata$annotation[,c("GeneID","Length")])
keep<-filterByExpr(y)
y<-y[keep,]
y<-calcNormFactors(y, method="TMM")
y_voom<-voom(y) 
GSEA_table<-y_voom$E
colnames(GSEA_table)<-pastemydata$Treatment[match(colnames(GSEA_table),mydata$Sample)],mydata$DaysToRebound,mydata$SampleID,sep="_")
 GSEA_table<-GSEA_table[,order(colnames(GSEA_table))]
 symb<-annotation$Symbol[match(rownames(GSEA_table),annotation$GeneID)]
 write.table(file="GSEA_table_final.txt",cbind("NAME"=symb,GSEA_table),row.names = F,quote = F,sep = "\t")

Second approach, I ran DESeq2 on raw counts, transformed it with vsn and fed that to GSEA.

dds<-DESeqDataSetFromMatrix(countData=fc_mydata$counts, colData=mydata, ~Treatment)
dds<-DESeq(dds)
res<-results(dds)
res_ordered<-res[order(res$padj),]
res_ordered<- res_ordered[order(res_ordered$padj, decreasing = F),]
rownames(res_ordered)<-make.names(annotation$Symbol[match(rownames(res_ordered), annotation$GeneID)], unique=TRUE)
head(res_ordered)
write.table(res_ordered, "res_ordered",sep="\t")

vst<-vst(dds, blind=F)
normalised_vst<-assay(vst)
rownames(normalised_vst)<-make.names(annotation$Symbol[match(rownames(normalised_vst), annotation$GeneID)], unique=TRUE)
normalised_vst<-normalised_vst[,order(colnames(normalised_vst))]
write.table(normalised_vst, "normalised_vst.txt", cbind("NAME" = rownames(normalised_vst), normalised_vst),row.names = F,quote = F,sep = "\t")

What do you think, which is the most correct one? What can I do to improve?

Thanks a lot!

gsea rna-seq vsn voom • 795 views
ADD COMMENTlink modified 13 months ago by Gordon Smyth2.3k • written 13 months ago by pennakiza60
2
gravatar for Gordon Smyth
13 months ago by
Gordon Smyth2.3k
Australia
Gordon Smyth2.3k wrote:

I am the voom author, but I do not recommend voom for this purpose because GSEA is unable to use the voom precision weights. Just use:

y <- DGEList(counts ...)
y <- calcNormFactors(y)
GSEA_table <- cpm(y, log=TRUE)

You do not need filterByExpr here but, if you do use it, it should be used in conjunction with your treatment factor or design matrix.

As an alternative to GSEA, you might also consider the pathway analysis functions such as camera that come as part of limma.

ADD COMMENTlink modified 13 months ago • written 13 months ago by Gordon Smyth2.3k

Thank you Gordon! I will try camera too!

ADD REPLYlink written 13 months ago by pennakiza60
0
gravatar for dsull
13 months ago by
dsull1.8k
UCLA
dsull1.8k wrote:

Look at the instructions here:

https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Using_RNA-seq_Datasets_with_GSEA

Using DESeq2 is fine (and other types of acceptable RNA-seq normalizations should also be fine).

You also may not need to vst-transform your normalized counts, per the following FAQ: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ#Should_I_use_natural_or_log_scale_data_for_GSEA.3F

DESeq2 also gives you the log fold changes (treatment compared to control) of each gene. You can just use the log fold changes to rank your gene list and then feed that ranked gene list into GSEAPreranked algorithm. This is what I generally do. If you do this, I'd recommend using lfcShrink to get more accurate log fold change values.

ADD COMMENTlink modified 13 months ago • written 13 months ago by dsull1.8k

Thank you very much!

I thought that you needed to transform the data before but I will certainly try it without it. I have tried to used preranked but in some cases I have only a few DE genes (p-adj <0.05) and if I only use those I don't get any gene sets enriched.

ADD REPLYlink written 13 months ago by pennakiza60

Ah, for the prerank algorithm, you still use ALL your genes, not just your DE ones. So if 200 out of 40000 genes are DE, you still use all 40000 genes. Basically, you just supply a list of ALL genes ranked by their log fold change.

That's the power of GSEA -- you use every single gene and it's not based off of cutoffs/thresholds. Contrast this to gene ontology overrepresentation analysis where you only input a list of genes that are considered DE.

ADD REPLYlink modified 13 months ago • written 13 months ago by dsull1.8k

That's cool, in this case I am getting the same results, which is reassuring!

Thank you very much for your help!

ADD REPLYlink written 13 months ago by pennakiza60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1606 users visited in the last hour
_