Clustering differentially expressed genes in response to multiple treatments (using edgeR)
Entering edit mode
2.8 years ago
eggrandio ▴ 40


I am analyzing RNA-seq data of arabidopsis samples in response to three treatments (let's call them WT, RT and DT). These treatments have some components in common, so I want to find genes that respond specifically to the unique components in each treatment.

In summary, I have performed the Differential Expression analysis of the three treatments versus the non-treated sample (NT), and I would like to cluster the DE genes in groups of genes that respond similarly to the three treatments and genes that only respond in one of the treatments. I cannot distinguish clearly these groups of genes, and I wanted to make sure that I am not missing something. I can "find" these genes by doing manual filtering (see at the end), but I would like to make a figure to visualize them.

This is the first time I am conducting such analysis, so I will describe what I have done step by step in case I have made any mistake. You can skip directly to step 7 for the clustering analysis. I have included more detailed questions in bold.

The steps I am not sure about are, z-scoring the RPKM data, and clustering methods.

1- Alignment & feature counts (I used

First, I ran FastQC and Trimmomatic to remove bad quality reads. Then I run HISAT2 to align the reads and for the sake of simplicity, I ran featurecounts to count the reads assigned to each gene (I have done also the counts per transcript in a separate analysis).

2- Creating DGE object using edgeR:



counts = read.delim(input, row.names = 1) #read output from FeatureCounts
GeneLength = (read.delim(feature.length, row.names = 1)) #read file with gene length of genes to calculate RPKM

#I have five replicates for each treatment:
           NT1  NT2   NT3  NT4  NT5
AT1G01010   48   51    97   24   55
AT1G01020  204  230   172  180  227
AT1G01030   91   89    59   79  115
AT1G01040 2897 2954  2915 2113 2762
AT1G01050 1011 1030  1140 1018 1137
AT1G01060 8784 9078 10806 8453 9749

#Create the treatment groups (NT - non-treated, W, R and D are the three different treatments):
snames = colnames(counts)
treatment = as.factor(substr(snames, 1, 2)) #remove replicate number
treatment = factor(treatment, levels=c("NT","WT","RT","DT"))

#Create Differential Gene Expression List Object (DGEList) object:
d0 = DGEList(counts = counts, 
             genes = data.frame(Length=GeneLength),
             group = treatment)
d0 = calcNormFactors(d0)

3- Filter DGE object to remove genes with low read count:

RPKM_cutoff = 1

#get index of genes below threshold 
drop = d0 %>% rpkm %>% rowMeans %>% `<`(RPKM_cutoff) %>% which 

#Remove genes with low read counts
d = d0[-drop, ,keep.lib.sizes=FALSE] 

d = calcNormFactors(d)

4- MDS plot of samples. WT and RT group together, while DT and NT seem to be more different. DT is the treatment I am most interested in.

5- Compare all samples against control (non treated).

design = model.matrix(~treatment)

d = estimateDisp(d, design, robust=TRUE)

fit = glmQLFit(d, design, robust=TRUE)

qlf = glmQLFTest(fit, coef=2:4) #compares all the samples to the intercept (control)

AllvsCtrl = topTags(qlf, n = Inf) #unfiltered data

6- Select DE genes with FDR<0.05 and log2(Fold change) greater than 1 in at least one comparison:

FDRts = 0.05 #set FDR threshold

Filtered_byFDR = topTags(qlf, n = Inf, p.value = FDRts)
FCts = log2(2) #set log2(Fold change) threshold

tmp = AllvsCtrl$table %>% select(c(logFC.treatmentWT,logFC.treatmentRT,logFC.treatmentDT)) %>%
  as.matrix %>% abs %>% rowMax %>% `>`(FCts) %>% which #Select only genes with abs(logFC)>FCts

Filtered_AllvsCtrl = Filtered_byFDR[tmp,]

          Length logFC.treatmentWT logFC.treatmentRT logFC.treatmentDT   logCPM        F       PValue
AT1G21240   2422        2.96674871         3.7371969          8.059693 4.035865 576.7402 3.437391e-19
AT3G15536    495        1.33337783         2.7300966          5.819152 2.281665 473.0633 1.508623e-18
AT1G09932   1443        1.88438553         3.3242943          5.707507 3.360015 385.8037 1.093145e-17
AT4G00700   3611        1.13613808         2.6300577          6.300830 4.405263 396.3470 2.928449e-17
AT2G38860   2127       -0.07110813        -0.1787093          3.025977 5.621593 360.9900 2.998005e-17
AT3G13950   1206        1.78454256         2.4608951          6.459634 2.503970 338.3309 3.899023e-17
AT1G21240 5.570980e-15
AT3G15536 1.222513e-14
AT1G09932 5.905536e-14
AT4G00700 9.717734e-14
AT2G38860 9.717734e-14
AT3G13950 1.053191e-13

[1] 2705

7- Plot data heatamp to cluster genes by expression pattern (here is where I cannot isolate the gene groups I want): Should I use Z-scored RPKM values? Should I use the log2(FC)?

#Convert RPKM to Z-scores:


colnames(RPKMd) = treatment

Should I average RPKMs between replicates before doing Z-score or after?

RPKMd = sapply(split.default(RPKMd, names(RPKMd)), rowMeans) 

zRPKMd = RPKMd %>% %>% zFPKM %>% as.matrix

#Select z-scored RPKM values of genes with DE in at least one treatment:
Hmat = zRPKMd[rownames(zRPKMd) %in% rownames(Filtered_AllvsCtrl$table),]

7a - From what I have been reading, pearson's correlation should be used to cluster genes with a similar expression pattern. When I do so, I see this weird clustering:

        clustering_distance_rows = "pearson",
        show_row_names = F,
        heatmap_legend_param = list(title = "Z-score"),
        column_title = "Treatments",
        row_title = "DE Genes")

This is how the heatmap looks like using the ComplexHeatmap default distance and clustering methods ("euclidean" distance, "complete" clustering):

        show_row_names = F,
        row_split = 15,
        heatmap_legend_param = list(title = "Z-score"),
        column_title = "Treatments",
        row_title = "DE Genes")

Clusters 6 and 8 (from top) are close to what I am looking for: genes with high expression in DT, and low in WT,RT and NT). Cluster 10 seems to have the opposite pattern, alsto interesting for me.

I think I can work with those groups of genes, but the visualization is not as clear as I would like (maybe I should change the color scale?).

I could also find these kind of genes "manually" doing something like this, but I would like to have a figure to show them, as I also want to include further analyses done on these groups in the same plot.

data = Filtered_AllvsCtrl$table

onlyDT = data[which((data$logFC.treatmentWT < 1) &
             (data$logFC.treatmentRT < 1) &
             (data$logFC.treatmentDT > 1)),]

          Length logFC.treatmentWT logFC.treatmentRT logFC.treatmentDT   logCPM        F       PValue
AT2G38860   2127       -0.07110813        -0.1787093          3.025977 5.621593 360.9900 2.998005e-17
AT3G25020   3121        0.37136843         0.4851968          2.618082 4.437666 288.8188 1.796895e-16
AT1G78290   1868       -0.79963552        -0.8454600          2.135336 4.624130 278.2638 2.608339e-16
AT2G37110   1173        0.37447283         0.9296454          1.626979 6.245539 244.4434 8.942511e-16
AT2G14610    902       -0.56183051        -1.0595904          7.413948 4.322275 306.0732 1.075079e-15
AT1G76970   2476        0.07736142         0.8145889          2.161111 4.203167 234.4904 1.332456e-15
AT2G38860 9.717734e-14
AT3G25020 2.426856e-13
AT1G78290 3.019525e-13
AT2G37110 6.587785e-13
AT2G14610 7.575569e-13
AT1G76970 8.305812e-13

Any help with the analysis workflow or visualization would be welcome!


RNA-Seq clustering complexheatmap edgeR heatmap • 1.9k views
Entering edit mode

Cross-posted on Bioconductor:

Entering edit mode

Is that not allowed? I can remove my post here or there. I donĀ“t know which forum would be best for this type of question.

Entering edit mode

There are no written rules but it's best to alert both communities about it so that nobody ends up duplicating efforts.


Login before adding your answer.

Traffic: 1096 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6