Question

RNA-seq based gene signature analysis

0

Entering edit mode

5.6 years ago

gspirito ▴ 10

Hello everyone, here's my question:

I have three datasets of RNA-seq data and WES of cases and controls of the same disease. After stratifying the cases in all three datasets based on the presence or absence of deleterious mutation on a particular gene I obtain: Controls, Cases_1 and Cases_2 for all three datasets. I then perform differential expression analyses for Cases_1 vs Controls and Cases_2 vs Controls for each dataset separately. If I select the genes I find differentially expressed in all three "Cases_2 vs Controls" analyses and do not find in all three "Cases_1 vs Controls", can I consider those as a signature for the Cases_2 group? Or should additional analyses be involved?

Thanks in advance to anyone who will answer.

RNA-Seq • 2.0k views

ADD COMMENT • link updated 5.6 years ago by Kevin Blighe 90k • written 5.6 years ago by gspirito ▴ 10

score 2 · Accepted Answer · 2020-04-28

can I consider those as a signature for the Cases_2 group?

Yes and no. It is 'tentative' evidence but, for now, it just says that you have identified a group of genes whose expression, when compared to controls, is different between Cases1 and Cases2. You will have to build more evidence to convince people that you have a 'signature', and of what would that signature be? - changes in expression due to the mutation? What were your thresholds for determining statistically significantly differentially expressed?

It would be good to see Cases1 vs Cases2, just purely out of interest.

You should also construct a binomial logistic regression model in order to build further evidence. This would be of the form:

glm(mutation ~ gene, data = mydata, family = binomial(link = 'logit'))

Here, mutation would be encoded as 1|0, gene would be a continuous variable of gene expression, and mydata contains Cases1 and Cases2 combined. This would provide more convincing evidence that a gene's expression was altered based on mutation status, but still not direct evidence that the mutation is the cause.

You could also perform a linear regression, but the interpretation changes, slightly:

lm(gene ~ mutation, data = mydata)

To use both regression models, the assumption would be that your input RNA-seq data has been normalised to adjust for biasing factors, and also transformed via log2(CPM + 1), variance-stabilised transform, or regularised log transform, or something else.

Kevin