Question: Categorize patients based on continuous variables
gravatar for ciemanek
5 months ago by
The Netherlands/Amsterdam
ciemanek100 wrote:


My problem concerns categorizing patients to groups based on continuous variables. From the previous studies we know that there are continuous differences in mean expression of two signatures, which are negatively correlated. We are interested in comparing two extreme groups in terms of differentially expressed genes. Is there any statistical method for determining the cutoff from tha data? Maybe some measure of similarity we could use? Would it be reasonable to cluster patients based on those two signatures and in that way choose extreme groups?

Any advice will be appreciated.

Regards, Agata

expression biostatictics • 244 views
ADD COMMENTlink modified 5 months ago by dariober9.3k • written 5 months ago by ciemanek100

Instead of using a single mean value for a signature you could try to cluster the samples using the expression of all genes present in the signature. This could help filter out some of the likely noise coming from genes that are part of the signature but that don't vary much in your data. The approach you described is otherwise reasonable.

ADD REPLYlink modified 5 months ago • written 5 months ago by Martombo2.1k

Yes, this was more or less my reasoning: to cluster patients based on all genes in both signatures and then set a cut-off on the branches. Would it be reasonable then to perform transcriptome-wide differential gene expression testing and co-expression analysis between two groups on such classified data?

ADD REPLYlink written 5 months ago by ciemanek100

Yes I think that's the best solution you can get. Also, you don't necessarily have to bin the samples in two groups: you can perform a differential expression analysis looking for gene patterns that correlate with a continuous variable. You can model your data on the gene signature score (in DESeq2, voom for example).

ADD REPLYlink written 5 months ago by Martombo2.1k

Thanks a lot! I will definitely take a look into that - in general those sigantures mean expression is correlated with the level of differentiation and what is of interest to me is what are possible underlying differences between highly and lowly differentiated tumors that's why my first thought was to zoom in to extreme groups.

Also since we're discussing, do you think performing co-expression analysis to find networks of genes would make sense and would have to be performed on the whole dataset (we have no control group) or rather on extreme groups to compare them? I wonder if reconstructing networks from whole dataset wouldn't be biased due to tissue-specific expression.

ADD REPLYlink written 5 months ago by ciemanek100

The clustering idea is great. If, in addition, you are interested in segregating patients based on the expression of just one gene of interest, then you could literally divide the patients into tertiles, quartiles, quintiles, et cetera, and then compare the top and bottom groups.

A co-expression network would also help, and it is possible to identify sub-groups (communities or modules) in such networks, and to then see how these sub-groups relate to yuor clinical variables. On the issue of tissue specificity, it's up to you to ensure that your samples are from the same tissue and that there is no bias in that sense. A good study design guards against biases like that.

ADD REPLYlink written 5 months ago by Kevin Blighe25k

About the tissue specificity - the data I have is only tumor data, we have no controls for that and what I mean is that some of the genes might be co-expressed due to the tissue of origin. I wonder how I could account for that - should I look for control data in databases (problem is that sample sizes are usually very small) or can it be considered during functional analysis?

ADD REPLYlink written 5 months ago by ciemanek100
gravatar for dariober
5 months ago by
Glasgow - UK
dariober9.3k wrote:

categorizing patients to groups

It seems to me that you are looking at statistical methods like logistic regression or decision trees or support vector machines (and many others, a lot of the machine learning literature is about classification).

ADD COMMENTlink written 5 months ago by dariober9.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1561 users visited in the last hour