Question

Cluster samples based on expression of a subset of genes

0

Entering edit mode

5.4 years ago

Sumit Paliwal ▴ 40

Hi All, I have RNA-Seq data on some cancer samples. I would like to cluster these samples based on the expression of a defined subset of genes. Subsequently, I want to select those samples that are with the highest and lowest expression of this gene subset. I am not sure how to do it. Any suggestions are welcome.

RNA-Seq Cluster • 1.5k views

ADD COMMENT • link updated 5.4 years ago by Kristoffer Vitting-Seerup ★ 4.0k • written 5.4 years ago by Sumit Paliwal ▴ 40

0

Entering edit mode

Take a look at bioconductor/rgsepd where I run DESeq to find genes with differences, then GOSeq to find gene groups of significance, then it moves on to what you're asking for directly. Subset the counts matrix by gene set, PCA on the sub-space, and youll get clustering with respect to named gene sets.

ADD REPLY • link 5.4 years ago by karl.stamm 4.1k

0

Entering edit mode

Hi Karl, Thanks for the info. The issue in my case is that I want to define sample groups based on expression of a subset of genes and then do a differential expression (DE) analysis. I do not know if I can in some way use raw/normalized counts of these subset of genes to categorize samples and subsequently do a DE analysis.

ADD REPLY • link 5.4 years ago by Sumit Paliwal ▴ 40

0

Entering edit mode

Could you be more specific of what data you have - is it quantified yet? If so in what unit? Once I know that I can guide you better :-)

ADD REPLY • link 5.4 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

Hi kristoffer, I have HTSeq counts from some TCGA cancer samples.

ADD REPLY • link 5.4 years ago by Sumit Paliwal ▴ 40

0

Entering edit mode

I would be good Biostars manners to update the question instead of adding a comment - makes it easier for people reading it int the future :-)

ADD REPLY • link 5.4 years ago by Kristoffer Vitting-Seerup ★ 4.0k

score 0 · Answer 1 · 2018-11-28

First of all I will recommend that you download the TCGA TPM/RPKM data instead - those are normarlized for a bunch of artefacts - you can read more about why library normalization is needed in this recent post. The important thing is for your analysis you need to normalize for both gene length, sequencing depth and inter-library differences.

Once you have that matrix I would suggest you simply subset it to the genes of interest and do a hierarchical clustering resulting in a dendrogram.

With regards to extracting highest and lowest expressed genes you should just compare the means across all samples from the expression matrix mentioned above - since it is normalized for all features it should be straight forward.