Question

fGSEA with preranked data based off gene expression using DeSeq2

1

Entering edit mode

3.9 years ago

jack.henry ▴ 50

I am trying to do some exploratory bioinformatics on TCGA data using fgsea.

Our lab looks at a specific gene so I was trying to see whether high levels of this gene in TCGA expression data is correlated with enrichment of any genesets. I have been preranking the data using DeSeq2 (and using the F stat as a ranking) and was wondering how I should set up the design.

Because it is a continuous variable I could plug in the scaled normalised counts for this gene straight into the DeSeq2 design or I could split the expression into low/high groups and then run the DeSeq2 to calculate the difference between low/high.

I was wondering whether which of these (if either) is more acceptable? I assume using the continuous variable makes the most sense but I have only seen it done by splitting the expression into two groups by other bioinformatics. Is the Wald test with DeSeq2 the most appropriate tool to do this with?

I have run both methods using the hallmark genesets and see very different ranking and similar but slightly different ES results. What are peoples' thoughts?

NES from HALLMARK enter image description here

RNA-Seq DeSeq2 gsea fgsea R • 3.2k views

ADD COMMENT • link 3.9 years ago by jack.henry ▴ 50

score 3 · Accepted Answer · 2020-06-11

3

Entering edit mode

3.9 years ago

dsull ★ 6.0k

Personally, I think the low/high stratification isn't ideal because you lose information about the expression of your gene of interest (you're collapsing everything into two values: low or high). I prefer the continuous design (edit: however, please see discussion below; important caveats).

An alternate approach would be to calculate the pair-wise correlation between every gene with respect to your gene of interest (using normalized count values); you can use the correlation coefficients are your ranking. Whether this is "better" than using the deseq2 statistic, I don't know. There are many ways to analyze data and the answer of what is "most acceptable" is not always clear or easy.

ADD COMMENT • link 3.9 years ago by dsull ★ 6.0k

2

Entering edit mode

I guess it depends a bit on the range of expression of that gene across the samples. If you use it as a continuous variable and it is poorly-expressed in some but "off-the-chart/super high" in some others wouldn't then a stratification make more sense, maybe low-middle-high?

ADD REPLY • link 3.9 years ago by ATpoint 82k

2

Entering edit mode

Ah, yes, agreed, I would recommend looking at the distribution of expression of that gene and see if distinct clusters exist.

Let's say there are six samples. In an extreme case for your gene of interest, three samples may have expression values 0.01, 0.02, 0.03 while three other samples may have expression values 100, 100.01, 100.02. The minor within-cluster 0.01 differences aren't meaningful and might screw up a continuous variable analysis (especially something like a pair-wise spearman correlation). Definitely would recommend a stratification in this particular case.

On the other hand, if your expression values for your gene of interest are 10, 20, 30, 40, 50, 60 -- a stratification might not be such a great idea.

I find that continuous variables work well in my experience, generally speaking, but there's no one-size-fits-all. It's a similar issue that often comes up with survival analysis: Do a cox regression with respect to gene expression as a continuous variable, or separate patients into high and low groups and show the two survival curves?

ADD REPLY • link 3.9 years ago by dsull ★ 6.0k

1

Entering edit mode

Thats a good point, thank you both. I see a pretty normally distributed expression so I think I am going to proceed with a continuous design and see where it takes me.

Histogram

ADD REPLY • link 3.9 years ago by jack.henry ▴ 50

0

Entering edit mode

Thanks a lot! I guess I don't mean the "most" but the "more" acceptable method. (edited to change it) I only ask because I only started R only a month ago and I have worried that was I was doing is just unaccepted in the bioinformatics community or something, it's settling to hear that I am thinking along the right lines!

ADD REPLY • link 3.9 years ago by jack.henry ▴ 50