can I consider those as a signature for the Cases_2 group?
Yes and no. It is 'tentative' evidence but, for now, it just says that you have identified a group of genes whose expression, when compared to controls, is different between Cases1 and Cases2. You will have to build more evidence to convince people that you have a 'signature', and of what would that signature be? - changes in expression due to the mutation? What were your thresholds for determining statistically significantly differentially expressed?
It would be good to see Cases1 vs Cases2, just purely out of interest.
You should also construct a binomial logistic regression model in order to build further evidence. This would be of the form:
glm(mutation ~ gene, data = mydata, family = binomial(link = 'logit'))
Here, mutation would be encoded as 1|0, gene would be a continuous variable of gene expression, and mydata contains Cases1 and Cases2 combined. This would provide more convincing evidence that a gene's expression was altered based on mutation status, but still not direct evidence that the mutation is the cause.
You could also perform a linear regression, but the interpretation changes, slightly:
lm(gene ~ mutation, data = mydata)
To use both regression models, the assumption would be that your input RNA-seq data has been normalised to adjust for biasing factors, and also transformed via log2(CPM + 1), variance-stabilised transform, or regularised log transform, or something else.
Kevin
Hi, thank you very much for the response.
So for the binomial logistic regression, would it be best to put all raw counts for the three datasets in one matrix, normalize and perform the regressions or do that separately for each dataset?
Can the normalization be performed with DEseq2 functions? such as:
Also, would it be best to perform the regressions and then see how many of the genes which feature a significant regression are differentially expressed in all three datasets (FDR < 0.05, log2FoldChange > 1 or < -1) or viceversa?
Thank you and sorry for the many questions,
Giovanni
Buonasera Giovanni, yes, I would use the
vsddata.The idea of the regression, in this case, is to confirm that each gene's expression differs based on mutation status:
mydatamay contain Group1 and Group2 samples, combined, or it may contain Group1 + Group2 + Controls. In each case, the meaning of the result will change. This is part of research.Please be flexible with these models, though, and use whatever you feel is appropriate.