Question: How to identify which genes are responsible for the different cluster without PCA
0
gravatar for camillab.
4 months ago by
camillab.40
London
camillab.40 wrote:

Hi,

I hope this is not a stupid question but I have done hierarchical cluster (euclidian distance matrix + complete linkage method) on a subset number of genes (8000) in my bulkRNAseq samples (40) and I found that 13 samples do not cluster as expected/predicted. I run also a PCA and, in line with the hierarchical cluster, those samples cluster far apart from the others.

Is there any way (or R package) I can identify which genes are responsible for the different cluster without using the PCA (eg., the identification of the loadings)?

Practical example in the dendrogram from this site dendo what makes purple samples( 7-13-16) differ from the red ones but also what makes the red + purple cluster in another brach/arm compared to the blue-green samples?

I guess there are genes that would make all the samples cluster together and genes that are very different so they would make the samples cluster far apart, and this could be potentially observed in terms of macro-differences (red/pruple vs green/blue) or micro-differences (red vs purple).

thank you in advance

Camilla

bulkrnaseq clustering hclust R • 234 views
ADD COMMENTlink modified 4 months ago by Friederike6.8k • written 4 months ago by camillab.40
1

I wouldn't use Euclidean distance in such a high dimensional space because it's most likely subject to distance concentration. If you nonetheless manage to get a good clustering it probably means you have a strong signal contributed by a limited number of genes. You should be able to identify them by looking at which ones contribute the most to the distances between cluster centres, i.e. rank the genes by the squared differences of their means in each cluster. Alternatively, you could train a classifier to predict membership to each cluster (using cluster membership as given labels) and examine the weights associated with each gene.

ADD REPLYlink written 4 months ago by Jean-Karim Heriche24k

ones contribute the most to the distances between cluster centres

that's exactly what I wanted to do but I am not able to figure out how to get this information from hclust do you know any link I can look at to understand how to do it? thank you for your answer!

ADD REPLYlink written 4 months ago by camillab.40

To get the clusters, you need to cut the tree generated by hclust, for example with the cutree function.

ADD REPLYlink written 4 months ago by Jean-Karim Heriche24k

Hi (again),

I tried with the cutree function (cut the dendogram in 5 cluster) but I got only this results no info about which gene contribute most apart that I guess are the samples in cluster 1 to contribute most:

clu.k5
 1  2  3  4  5 
46  1  1  1  1

Where do I make mistake? here my dataset:

 A tibble: 6 x 51
  gene  `4_MU` `16_MU` `21_MU`  `0c` `0c_bs1_2` `0c_bs2`
  <chr>  <dbl>   <dbl>   <dbl> <dbl>      <dbl>    <dbl>
1 A4GA~  0.382   0.176   0.316  5.34       4.47     10.0
2 AAAS   3.13    5.22    5.02  28.8       24.2      19.9
3 AACS  21.2    19.7    16.9   13.3       14.0      13.1
4 AAGAB 14.7    22.7    18.8   35.3       37.5      45.4
5 AAK1  17.1    12.5    18.6   16.1       15.1      20.9
6 AAMP  63.8    72.7    65.7   23.4       19.9      16.6
# ... with 44 more variables: `24c` <dbl>,

and here my script:

#tidy the dataset
df1 <- df %>% drop_na() #remove rows with NA from the merged filed
rnames <- df1$gene#select name
df1 <- df1[-c(1)] # remove gene symbol
df2 <-(as.matrix(df1))
rownames(df2) <- rnames # assign row names
df3 <- t(df2) #transpose
df4 <- scale(df3) #scale

#hierechical cluster
d=dist(df4) #dissimilarity matrix
hc=hclust(d,method="complete")
plot(hc)

#cut in 5 cluster
clu.k5=cutree(hc,k=5)
rect.hclust(hc, k=5, border = "green")
ADD REPLYlink modified 4 months ago • written 4 months ago by camillab.40
1

cutree returns a vector of cluster memberships. You then need to extract the data for each cluster with e.g. for cluster 1 df4[clu.k5==1, ]

ADD REPLYlink written 4 months ago by Jean-Karim Heriche24k

Is there a specific reason why you do not want to do PCA, it sounds like a good job for PCA. You can also visualize loadings plot using PCATools package. It is pretty easy to make

ADD REPLYlink written 4 months ago by ashish570

I did it but I would like to be able to discriminate between differences across all samples (which I can do with "loading" in the PCA) and those between specific groups and I cannot do it with PCA. Like with the PCA in my example before I can find what I called macro-differences (red/pruple vs green/blue) but not micro-differences (red vs purple) without removing samples so without changes the result of the PCA. I don't know if it makes sense

ADD REPLYlink written 4 months ago by camillab.40
0
gravatar for Friederike
4 months ago by
Friederike6.8k
United States
Friederike6.8k wrote:

Is there any way (or R package) I can identify which genes are responsible for the different cluster without using the PCA

Yes, DESeq2, edgeR and limma would be the most popular tools to achieve this, i.e. compare replicates of specific groups of cells/samples to each other. All details can be found here, but for a less involved analysis you could also give pcaExplorer a shot that will take care of many of the details for you.

ADD COMMENTlink written 4 months ago by Friederike6.8k

Do I need raw read to use with DESeq2, edgeR and limma right?

ADD REPLYlink modified 4 months ago • written 4 months ago by camillab.40

yes, that's correct.

ADD REPLYlink written 4 months ago by Friederike6.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1469 users visited in the last hour
_