RNA-seq based gene signature analysis
1
0
Entering edit mode
21 months ago
gspirito ▴ 10

Hello everyone, here's my question:

I have three datasets of RNA-seq data and WES of cases and controls of the same disease. After stratifying the cases in all three datasets based on the presence or absence of deleterious mutation on a particular gene I obtain: Controls, Cases_1 and Cases_2 for all three datasets. I then perform differential expression analyses for Cases_1 vs Controls and Cases_2 vs Controls for each dataset separately. If I select the genes I find differentially expressed in all three "Cases_2 vs Controls" analyses and do not find in all three "Cases_1 vs Controls", can I consider those as a signature for the Cases_2 group? Or should additional analyses be involved?

RNA-Seq • 782 views
2
Entering edit mode
21 months ago

can I consider those as a signature for the Cases_2 group?

Yes and no. It is 'tentative' evidence but, for now, it just says that you have identified a group of genes whose expression, when compared to controls, is different between Cases1 and Cases2. You will have to build more evidence to convince people that you have a 'signature', and of what would that signature be? - changes in expression due to the mutation? What were your thresholds for determining statistically significantly differentially expressed?

It would be good to see Cases1 vs Cases2, just purely out of interest.

You should also construct a binomial logistic regression model in order to build further evidence. This would be of the form:

glm(mutation ~ gene, data = mydata, family = binomial(link = 'logit'))


Here, mutation would be encoded as 1|0, gene would be a continuous variable of gene expression, and mydata contains Cases1 and Cases2 combined. This would provide more convincing evidence that a gene's expression was altered based on mutation status, but still not direct evidence that the mutation is the cause.

You could also perform a linear regression, but the interpretation changes, slightly:

lm(gene ~ mutation, data = mydata)


To use both regression models, the assumption would be that your input RNA-seq data has been normalised to adjust for biasing factors, and also transformed via log2(CPM + 1), variance-stabilised transform, or regularised log transform, or something else.

Kevin

0
Entering edit mode

Hi, thank you very much for the response.

So for the binomial logistic regression, would it be best to put all raw counts for the three datasets in one matrix, normalize and perform the regressions or do that separately for each dataset?

Can the normalization be performed with DEseq2 functions? such as:

dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
vsd <- varianceStabilizingTransformation(dds)


Also, would it be best to perform the regressions and then see how many of the genes which feature a significant regression are differentially expressed in all three datasets (FDR < 0.05, log2FoldChange > 1 or < -1) or viceversa?

Thank you and sorry for the many questions,

Giovanni

1
Entering edit mode

Buonasera Giovanni, yes, I would use the vsd data.

The idea of the regression, in this case, is to confirm that each gene's expression differs based on mutation status:

glm(mutation ~ gene, data = mydata, family = binomial(link = 'logit'))
lm(gene ~ mutation, data = mydata)


mydata may contain Group1 and Group2 samples, combined, or it may contain Group1 + Group2 + Controls. In each case, the meaning of the result will change. This is part of research.

Please be flexible with these models, though, and use whatever you feel is appropriate.