Question: how to evaluate a gene signature is present in each patient of a dataset
gravatar for salvatore.raieli2
2.2 years ago by
salvatore.raieli270 wrote:

Hi everyone,

I have microarray dataset (700 patients) I identified different genes that correlates with the oncogene of interest. I made two gene set:

1) the genes that most correlate with the oncogene 2) the genes that anti-correlate with the oncogene

I would like to separate the patient in two group (for instance, the patients where the gene signature is present and the ones they do not express the signature).

What I would like to generate is at the end a dummy variable (1 for the signature present in the patient, 0 not). How I can establish the signature is present in the patient? exist also a test/metric to evaluate this as significant? If you can also suggest how to implement this in R it would be great.

Thank you in advance for your help,



gene signature gene set R • 874 views
ADD COMMENTlink modified 2.2 years ago by Jean-Karim Heriche23k • written 2.2 years ago by salvatore.raieli270

I am not sure if I understand your approach, did you find it somewhere in literature or came up with it yourself? It looks to me like you are mixing up two things, 1) machine learning, 2) limma roast.

With machine learning, you select a set of genes (also called feature selection), and then with a prediction model you can classify each sample into a group. For this kind of analysis you'll need predefined groups, for example patients with good or bad prognosis. I haven't seen any good feature selection methods based on correlation or anti-correlation, though.

With limma roast you can test a gene signature when comparing groups statistically. So not per sample, but per group contrast.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Benn8.0k

a part from calculating the correlation with MYCN I did other analysis.

I have a microarray dataset of patients where there are MYCN amplified and MYCN not amplified, I performed logistic regression with L1 penalty (lasso) to do feature selection. So you suggest to use like KNN to divide in two groups according to the feature selected in this way?

I have also the clinical data, the idea is after to do a Cox proportional-hazards model and Kaplan Meyer curve with this group

ADD REPLYlink written 2.2 years ago by salvatore.raieli270

I think that sounds more feasible, the lasso selected features for e.g. KNN or other ML method. Take a training and test set into account. Survival is another option, it depends on your research question (can you divide patient with/without MYC amplification by gene expression profile, or the other question is if the profiles can predict survival).

ADD REPLYlink written 2.2 years ago by Benn8.0k

Through ML algorithms I am performing feature extraction that are important to determine if a patient is a MYCN amplified or not, from this I would like to select some specific signatures (go pathways, some genes upregulated in cell lines by some drugs) and I would like to separate the patients in two group according if onne signature is present or not. Then I would like to see which of this signature has a best impact on prognosis

ADD REPLYlink written 2.2 years ago by salvatore.raieli270
gravatar for Jean-Karim Heriche
2.2 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche23k wrote:

One simple approach would be to compute the similarity of each patient's profile to the reference signature then cluster the patients into two groups. Assign 1 to the group where similarity is high and 0 to the other. You could compute enrichment by setting a threshold on similarity to the reference and computing the proportion of patients in each cluster that are above the threshold.

ADD COMMENTlink written 2.2 years ago by Jean-Karim Heriche23k

thank for your answer, what do you mean with similarity? how can I compute the enrichment?

ADD REPLYlink written 2.2 years ago by salvatore.raieli270

My answer wasn't well thought out. What I meant was to represent each patient and the signature as a vector of the expression levels of the genes of interest then compute for example the cosine between the signature vector and each patient vector. You could then rank the patients by their similarity to the signature and partition the list based on a threshold. The enrichment test I mentioned becomes irrelevant since the partitioning is done by thresholding. However, you could still look for enrichment in other patient attributes. Alternatively, you could use the vectors to compute a patient x patient similarity matrix and use it to cluster the patients then test whether the clusters are associated with a high similarity to the signature by the enrichment approach I mentioned.

ADD REPLYlink written 2.2 years ago by Jean-Karim Heriche23k

Thnak you for your answer. I am sorry I am not fully understood what you propose. I saw in another thread about TGCA data, they have normal sample and they compute z-score for each gene from the normal sample average. Then, based on the z-score they set a theresold and if the gene expression pass this theresold is expressed. on the same way or they do the average of the genes in the signature (mean of Z score more than the theresold) or sum of the gene over the theresold. I was at the beginning thinking on these approach but I have not a normal sample to do this. The question is: I identify a signature of genes, and I want to know which patients is enriched for that signature. How I can do it?

So, I first thought to use filter my dataset for the genes of the signature and then use KMeans (dividing in two groups) or a similar cluster approach, but I am not satisfy by this. How you rank the patient by the similarity to a signature?

ADD REPLYlink written 2.2 years ago by salvatore.raieli270

Let's take an example. If your signature has three genes A, B and C, the signature vector is s = (a,b,c) where a, b and c are the expression levels of A, B and C respectively. Now for a patient X, the vector is x = (x(A), x(B), x(C)) where x(A), x(B), x(C) are the expression levels of genes A, B and C for patient X. Now a simple similarity measure is the cosine (of the angle) between two vectors so sim(patient X, signature) = cos(x,s). Repeat for each patient and you get a list of similarities that you can use to rank the patients. This approach takes into consideration the expression levels. If you don't care about actual expression levels but only if a gene is expressed or not, simply convert each expression level in the vectors to 0 (not expressed) and 1 (expressed) and then proceed as above.

ADD REPLYlink written 2.2 years ago by Jean-Karim Heriche23k

Again thank you for your help. The only point that for me is not clear, is for the signature vector what are the associated values? for instance, suppose the vector for the patient X for the gene A, B, C is X = [4.5 , 6.5, 7.1] for the signature vector should be S = [1, 1, 1]?

ADD REPLYlink written 2.2 years ago by salvatore.raieli270

I assumed that the signature was composed of expression values for some marker genes. If all you have is a list of marker genes that are expressed/not expressed then you need to also binarize the patients vectors as explained above.

ADD REPLYlink written 2.2 years ago by Jean-Karim Heriche23k

I was thinking about this point, How I can binarize it? I assume one gene in pazient X is 1 if the value is above a theresold? or you would use another system?

ADD REPLYlink written 2.2 years ago by salvatore.raieli270

You define a threshold above which you consider a gene is expressed.

ADD REPLYlink written 2.2 years ago by Jean-Karim Heriche23k

Are you sure such an approach will be publishable?

ADD REPLYlink written 2.2 years ago by Benn8.0k

Why not ? What's wrong with it ?

ADD REPLYlink written 2.1 years ago by Jean-Karim Heriche23k

arbitrary? there is not the hazard of many false positive and false negative? I was also reading some papers, for instance gene set analysis where they do a rank of the signature genes in the patient sample to evaluate if the signature is enriched or not. what do you think about?

ADD REPLYlink written 2.1 years ago by salvatore.raieli270
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 730 users visited in the last hour