Question

How to choose certain gene set to get known clusters?

0

Entering edit mode

7.1 years ago

huwenhuo ▴ 40

I have cluster information of patients based on the pathological classification. For example, the patients can be classified by stage 1, 2, 3, 4 by the pathologist. And now I have the RNA-Seq data set of all these patients. How can I choose the a set of genes from RNA-Seq that can fit the pathological classification?

The above has been edited by adding the 2nd sentence. Thanks.

RNA-Seq R gene cluster • 2.0k views

ADD COMMENT • link updated 7.0 years ago by Benn 8.3k • written 7.1 years ago by huwenhuo ▴ 40

0

Entering edit mode

your question is not clear. What do you mean by cluster information? pathological classification means your genes should enrich for specific pathological conditions and you need to define that set first and then see if such can be seen as a cluster or not in your heatmap annotation, if that is how you want to see.

ADD REPLY • link 7.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Do you mean supervised clustering? Also known as machine learning. If yes, there are many different methods for feature selection and classification. Such as SVM or random forests.

ADD REPLY • link 7.0 years ago by Benn 8.3k

0

Entering edit mode

The OP needs to be more specific as to what OP is trying to achieve otherwise the question seems vague. A Classical example of the clarity of things one need to perform and so is the clarity of question needed. OP needs to either point out to a paper or a figure mentioning that is what need to be achieved with clear cut inputs of the data type the OP is having as of now.

ADD REPLY • link 7.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

I agree, but it seems rather normal to ask vague questions on this site.

ADD REPLY • link 7.0 years ago by Benn 8.3k

0

Entering edit mode

asking a question is never discouraged, we as experienced users for a reason should step-up in the forum and comment asking the OP to clarify. If the OP does not then the question can be closed by the moderators. This should help to maintain the sanity check of the site as well. This is how it keeps the site from having open threads for unanswered queries which were inconclusive or vague. The OP is a newbie so I feel it is my duty to welcome the OP and give them an understanding of how to make better and clear questions to get the best possible solution. We at Biostars never chide people for a vague or unclear question rather encourage to put it with clarity. Even if someone did earlier I have always seen others backing up to rescue the OP to come up with a clarification. This helps a person to grow and learn how this forum works. I do not mind answering as long as the OP puts an effort to clarify and understands what is needed to be done and where he/she is at fault so that an experienced forum person can help out or link to a possible solution. I just do not want to start a thread of SE here anymore. But most sites start with the vague question where experts make effort to correct them or ask OP to put more clarity. Forums are here for a purpose of educating and provide information and not discourage.

ADD REPLY • link 7.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

What does SE mean?

I said that I agree, I don't know why you give me this story???

ADD REPLY • link 7.0 years ago by Benn 8.3k

0

Entering edit mode

StackExchange. There is some discussion about it starting a competing bioinformatics QA/forum site.

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

Ah thanks for the explanation, I've seen the discussion but didn't follow it...

ADD REPLY • link 7.0 years ago by Benn 8.3k

0

Entering edit mode

Ah sorry for the detailed explanation and mentioning the StackExchange as I see people here are targeting Biostars having incorrigible posts and what not. I am really thankful for this site and really indebted to it for my learning over the years. So I always encourage people to visit it and learn. This is the reason I mentioned but I did not mean to freak out anyone. I want people to understand the usefulness of this forum and even if they are newbies and not proficient in questioning they can learn with our help and get a better learning and solve their problems.

Now let's not be off-tracked. I would request the OP to be a bit more clear with the query so that the experienced users can help and not close the question if it remains unattended for a long time by the OP. Sorry if I just dragged the thread.

ADD REPLY • link 7.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

I am the person asked this question. The question has been edited.

BTW, What is OP stand for here?

ADD REPLY • link 7.0 years ago by huwenhuo ▴ 40

0

Entering edit mode

you better start following the link of not putting a comment in the answer. It is time to learn how to use the forum and "OP" stands for "original poster" . I request you to clarify your question for a better answer from the experienced users.

Welcome and learn well. :)

P.S.: Moderator can move the answer of OP to a comment along with mine.

ADD REPLY • link 7.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

Please edit your original question to include additional details as needed.

ADD REPLY • link 7.0 years ago by GenoMax 141k

score 1 · Answer 1 · 2017-05-03

Ah it is now a bit clear , as I can see. If I was in your place, I would just generate a count matrix of all the patients with genes as rownames and read counts of patients as columns, and perform a PCA plot to capture the grouping of the patients to understand if the samples (patients) are grouped by stages or they show dissimilarities. This is one approach. Considerably you should also not only do the PCA in with raw counts but also with normalized counts to see how the data behaves. Remember to color your sample points based on the attributes of stages. This helps you in dimension reduction and also let you understand how your samples behave. Post you can start with other exploratory analysis of performing differential expression across all patients or even for that matter do stage specific differential expression analysis(DEA) based on your PCA output. I would not encourage here an hierarchical clustering of patients scaling around the rows(genes) by a heatmap since you will have a lot of genes. It is still informative but PCA will be a better approach at this point. Let me know if you think you understand.

score 1 · Answer 2 · 2017-05-03

1

Entering edit mode

7.0 years ago

Benn 8.3k

Depending on how large your data set is, I would say use supervised clustering, hence classifier (or discriminant) analysis. If your data set is small it wouldn't really make sense to use this, but if your data set is large (let's say > 100 patients) it is worth a try.

There are many different classification methods. Try one or two to see if there is huge difference.

ADD COMMENT • link 7.0 years ago by Benn 8.3k

1

Entering edit mode

I agree with your approach. Since I feel the query is made by a newbie hence my approach was classical without making any assumptions and derive a hypothesis based on unsupervised method. Definitely a supervised method using classifier is a great approach depending upon the posters knowledge of handling RNASeq data. Great answer btw. :)

ADD REPLY • link 7.0 years ago by ivivek_ngs ★ 5.2k

1

Entering edit mode

Yeah there are many ways to analyze a data set, if the OP really wants to a gene set to distinguish between the groups this approach could help. But it is definitely wise to also do PCA and DE analysis as well, e.g., DE genes could even serve as classifier genes.

ADD REPLY • link 7.0 years ago by Benn 8.3k

1

Entering edit mode

Exactly the DE genes will even help as classifier once the OP performs the hierarchical clustering with them and then annotate the samples based on the stages or even for that matter use k_means to see if it makes a clustering as per the stages. I am always a fan of unsupervised learning apriori since I do not want to build a hypothesis without exploring the data. But definitely supervised methods are way cooler if the OP has the required skill sets.

ADD REPLY • link 7.0 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

DE may not very reliable here because the pathologic diagnosis would not be truly reflect the molecular expression pattern here, the pathologic diagnosis is usually a combine of several types of molecular types.

I tried with concensus clustering analysis. It works nice to separate the patients, but the classes not necessary fit with the pathologist diagnosis, partially fitting. Molecular classes from such as concensus clustering or others are neglecting pathologic diagnosis.

We could try supervised classification here, top varied (big SD) genes as candidate features, and pathologic diagnosis as targets. And to test which features have more power in classifying the patients. The biggest problem is the huge number of gene candidates. I am not convinced by this, but I will give it a try.

But I just realized what I was initially trying to make was how to have an intermediate way to choose a gene set and use them to classify the patients that I can take care of both molecular subtypes and pathologist diagnosis. The problem here we have is that the pathologic diagnosis is usually a combine of several types of molecular types. While

A convenient way is to choose the representative genes as classifiers. I am really not think it over before I post this OP. Thank you very much all the comments.

ADD REPLY • link 7.0 years ago by huwenhuo ▴ 40

0

Entering edit mode

Molecular classes from such as concensus clustering or others are neglecting pathologic diagnosis

Is it possible that you are actually seeing things that are not discernible via pathology alone? Wasn't classification of breast cancers advanced because of molecular classes beyond what pathologists were able to do before?

ADD REPLY • link 7.0 years ago by GenoMax 141k