Question

Survival plot between low and high expression of gene

0

Entering edit mode

5.7 years ago

Biologist ▴ 290

Hi,

I wanted to make a survival plot showing between low and high expression samples of a gene. I followed this cutpoint using maxstat package to divide samples into low and high. In that tutorial they used rsem normalised counts gene expression data.

I have raw counts from featurecounts package. Along with that I also have rpkm data also.

First I used rpkm data and plotted the survival and it looks like this: survival plot b/w low and high with rpkm expression This showed p-value = 0.026.

Secondly, I used normalized counts [converted counts to normalised counts using Deseq2] and plotted the survival and it looks like this: survival plot b/w low and high with normalised counts I see the p-value = 0.1

Both plots have same pattern, there is no change at all but why the p-values are totally different? When I used rpkm I see that it is significant and when I used normalized counts it is not significant. What could be the reason?

Which units of gene expression data I should use to divide samples into low and high?

RNA-Seq r survival geneexpression • 3.2k views

ADD COMMENT • link updated 5.7 years ago by Santosh Anand 5.7k • written 5.7 years ago by Biologist ▴ 290

score 2 · Answer 1 · 2018-08-15

2

Entering edit mode

5.7 years ago

Devon Ryan 104k

But there is a very important difference between the plots, namely the "low" values in the bottom plot are MUCH closer to the "high" values. This is why there's a difference in the P-values. You can see this in the "Strata" plot, where there's a constant difference of 1 between the top and bottom set of plots.

ADD COMMENT • link 5.7 years ago by Devon Ryan 104k

0

Entering edit mode

Oh yes. thank you. What could be the reason for that? because of different expression data?

And what would you recommend to use for dividing samples into low and high based on expression, normalized counts or rpkm? or fpkm or any other?

ADD REPLY • link 5.7 years ago by Biologist ▴ 290

1

Entering edit mode

For a single gene it won't matter, unless you have isoform switching or something like that. If your gene-level metric is a summary of transcript-level metrics then TPM is going to be the most useful.

ADD REPLY • link 5.7 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon,

Small doubt. TPM converted from raw feature counts can be used for this Analysis? I used the following function to convert.

tpm <- function(counts, lengths) {
  rate <- counts / lengths
  rate / sum(rate) * 1e6
}

ADD REPLY • link 5.7 years ago by Biologist ▴ 290

1

Entering edit mode

That looks right at least.

ADD REPLY • link 5.7 years ago by Devon Ryan 104k

score 1 · Answer 2 · 2018-08-15

The curves look slightly different because the maxstat algorithm in the first case assigns 18 samples in the low group, but in the 2nd case, there are 20 samples. This means that the fraction of samples surviving in the second group would be higher at most of the event points, which makes the blue curve in the 2nd group to move a little bit up and come closer to the yellow => low p-value.

And what would you recommend to use for dividing samples into low and high based on expression, normalized counts or rpkm? or fpkm or any other?

If your choice of count-algorithm gives different results, then the right Q to ask is if the results are robust. And according to me, they are not. Also, there is not enough power because maybe you are taking low/high as a thin boundary line, which is blurring the distinction between the two. You may try categorizing something like low|medium|high and check if the results are robust for the low vs high group by all of the count methods. Robustness is more important than any particular method because all of them are essentially measuring the same thing.