Top Variable DEGs into a heatmap
2
1
Entering edit mode
7 months ago
ccha97 ▴ 20

I currently have a heatmap (which is k-means clustered) of the top 1000 most variable genes, but I want to be able to narrow it down to about 100-200. The reason why it's set so high is that when I want to analyse each cluster, I end up filtering out any genes that are not considered significantly differentially expressed genes (e.g. padj < 0.05, log2foldchange < 0 or > 0). However when I do this, I get left with a small amount of genes which isn't ideal for when I'm doing GO-term analysis - hence why I start off with a larger number (like 1000), so that even after filtering there's enough genes in the cluster for me to analyse.

I was wondering if there is a way to filter out the genes beforehand, so that all the genes in the heatmap are already significantly expressed and I won't have to filter them out afterwards - that is, all the genes in the heatmap will actually be used for downstream analyses. Hope that makes sense! Also open to any recommendations and suggestions on how to improve my approach.

Here is my code:

#heatmap
topVarGenes <- head(order(rowVars(assay(rld)), decreasing = TRUE), 1000)

#clustered heatmap
set.seed(1234)

k <-   pheatmap(assay(rld)[topVarGenes,], scale="row",kmeans_k = 4)
clusterDF <- as.data.frame(factor(k$kmeans$cluster))
colnames(clusterDF) <- "Cluster"

OrderByCluster <- assay(rld)[topVarGenes,][order(clusterDF$Cluster),] pheatmap(OrderByCluster, scale="row",annotation_row = clusterDF, show_rownames = FALSE,cluster_rows = FALSE) acute_hi <- rownames(clusterDF[clusterDF$Cluster == 1,,drop=FALSE])

#DEGs
resSigind = res[ which(res$padj < 0.05 & res$log2FoldChange > 0), ]
resSigrep = res[ which(res$padj < 0.05 & res$log2FoldChange < 0), ]
resSig = rbind(resSigind, resSigrep)

#filtering out any genes from the heatmap which are not significantly different (e.g. do not overlap with DEGs)
acutehi_cluster <- resSig[acute_hi,]

deseq2 heatmap RNA-seq DEGs R • 546 views
2
Entering edit mode
7 months ago
antonioggsousa ★ 2.0k

Hi,

What you can do is to retrieve the list of genes that are differentially expressed, and use this list first to filter the rlog transformed matrix that you have, before trying to obtain the most variable genes. This would return the most variable genes among the genes that are differentially expressed.

António

0
Entering edit mode

Hi António, Thanks for your response. Based on your suggestion I've done this:

rldsig <- rlog(dds)[allSig_genes,]

set.seed(1234)

k <-   pheatmap(assay(rldsig)[topVarGenes,], scale="row",kmeans_k = 4)
clusterDF <- as.data.frame(factor(k$kmeans$cluster))
colnames(clusterDF) <- "Cluster"

OrderByCluster <- assay(rldsig)[topVarGenes,][order(clusterDF$Cluster),] pheatmap(OrderByCluster, scale="row",annotation_row = clusterDF, show_rownames = FALSE,cluster_rows = FALSE)  However when I run this, I get the gpar fill error - "Error in check.length("fill") : 'gpar' element 'fill' must not be length 0". I understand it's something to do with matching the annotation columns of the matrix to the row names of the rownames of the dataframe, however am not sure how to go about it in my situation. ADD REPLY 1 Entering edit mode Hi ccha97, What I was suggesting was slightly different from what you've tried (I think so). So, if I understood you retrieve the DEG genes from the rlog transformed data, and then you subset this table to the genes that are most variable. Is this correct? If so, I believe the problem is that when you subset the table for DEG only, many/some genes found as top 1000 most variable will not be among the DEG genes, therefore, when you subset the table again to the most variable genes, it gives you that error. Though I'm not sure if the error is related with this problem. So, try the following to see if the error disappears: topVarGenes <- head(order(rowVars(assay(rld)[allSig_genes,]), decreasing = TRUE), 1000)  This is what I've suggested in first place. Notice that despite you're retrieving the top 1000 most variable genes, the outside head() function will only select the first 10 or so. This essentially will subset your rlog data to the DEG genes only, and only then will search among these for the most variable genes. Not sure if you've at least 1000 DEG genes (you need to have since you're specifying the top 1000). Then just follow the same code as you posted in your question. Let me know if this works. António ADD REPLY 0 Entering edit mode Thank you for your well-written explanation - your line of code worked well and it explains what I wanted to do with my data! ADD REPLY 0 Entering edit mode If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work. ADD REPLY 0 Entering edit mode Sorry for re-bringing up this issue, but I was double checking the genes in the cluster and it appears some of the genes on the heatmap aren't actually significant, which means they're still being expressed on the heatmap (I've checked whether their ENTREZ IDs are included in allSig_genes, and they are not). I've also double checked to make sure allSig_genes is correctly defined, and it looks right so it's likely the problem may be coming from somewhere else? I checked assay(rld)[topVarGenes,] and this appears to still have insignificant genes ADD REPLY 0 Entering edit mode It seems odd to me. What is the result of: all(topVarGenes %in% allSig_genes)  António ADD REPLY 0 Entering edit mode I agree, I don't think it's selecting from the significant gene list, but it is different from the heatmap I originally had so I'm not sure. > all(topVarGenes %in% allSig_genes) [1] FALSE  ADD REPLY 0 Entering edit mode Can you post here the code that you're using, from the first line to the last line? ADD REPLY 0 Entering edit mode dds <- DESeq(deseqdata) res <- results(dds, alpha = 0.05) resSigind = res[ which(res$padj < 0.05 & res$log2FoldChange > 0), ] resSigrep = res[ which(res$padj < 0.05 & res$log2FoldChange < 0), ] resSig = rbind(resSigind, resSigrep) allSig_genes <- rownames(resSig) library("pheatmap") library(cluster) rld <- rlog(dds) topVarGenes <- head(order(rowVars(assay(rld)[allSig_genes,]), decreasing = TRUE), 100) set.seed(1234) k <- pheatmap(assay(rld)[topVarGenes,], scale="row",kmeans_k = 3) clusterDF <- as.data.frame(factor(k$kmeans$cluster)) colnames(clusterDF) <- "Cluster" OrderByCluster <- assay(rld)[topVarGenes,][order(clusterDF$Cluster),]

pheatmap(OrderByCluster,
scale="row",annotation_row = clusterDF,
show_rownames = FALSE,cluster_rows = FALSE)

1
Entering edit mode

Yes, now I understand what is wrong here. What is happening is that when you do assay(rld)[allSig_genes,], you are sub setting the data to the significant genes right. So, the topVarGenes variable contains not gene ids, but vector positions of the most variable genes among the significant genes. However, when you give again the main rlog data, i.e., rld object, to the pheatmap when you filter by topVarGenes is not correct because the rld object contains both, significant and not significant, and topVarGenes is the positions of the most variable genes among the significant only. So, in practice to solve this problem you might filter the rld object to the significant genes, and only then apply obtain the topVarGenes and the pheatmap on the same rld object filtered only to significant genes in both cases. In this case, topVarGenes will be a numeric no. with the positions of the most variable genes to filter, in a table with significant genes only (in this case will match). I created this below as rld_sign <- assay(rld)[allSig_genes,]. I think now, this code will work. Please let me know, if it works.

Run this code:

dds <- DESeq(dds)
res <- results(dds, alpha = 0.05)

resSigind = res[ which(res$padj < 0.05 & res$log2FoldChange > 0), ]
resSigrep = res[ which(res$padj < 0.05 & res$log2FoldChange < 0), ]
resSig = rbind(resSigind, resSigrep)

allSig_genes <- rownames(resSig)

library("pheatmap")
library(cluster)

rld <- rlog(dds)

rld_sign <- assay(rld)[allSig_genes,] # rld object with significant only
topVarGenes <- head(order(rowVars(rld_sign), decreasing = TRUE), 100)

set.seed(1234)
k <-   pheatmap(rld_sign[topVarGenes,], scale="row",kmeans_k = 3)
clusterDF <- as.data.frame(factor(k$kmeans$cluster))
colnames(clusterDF) <- "Cluster"

OrderByCluster <- rld_sign[topVarGenes,][order(clusterDF$Cluster),] pheatmap(OrderByCluster, scale="row",annotation_row = clusterDF, show_rownames = FALSE,cluster_rows = FALSE)  ADD REPLY 0 Entering edit mode Thanks António, I think assigning the variable rld_sign makes more sense. The only issue I get when running this code is I get the gpar error again when running the last line (pheatmap).  Error in check.length("fill") : 'gpar' element 'fill' must not be length 0  ADD REPLY 0 Entering edit mode Sorry for the late reply. Did you see this issue: Pheatmap error: 'gpar' element 'fill' must not be length 0 , what to do about that in an assymmetrical matrix . It might help. It is not clear for me what can be causing this problem. António ADD REPLY 1 Entering edit mode 7 months ago You could simply filter the genes by the significance level before running your k-means analysis: #extract significant results get_sigs <- function(df, adj.p, FC){ df <- df[df$adj.P.Val < adj.p & abs(df\$logFC) >= FC,]
return(df)
}


This is how I reduce my gene/protein lists before running downstream analysis.

Hope this helps,

Sebastian