Question: R code for Differential Gene Expression
0
gravatar for landrjos
7 months ago by
landrjos10
landrjos10 wrote:

Hi All,

I was wondering if you could look over my R code for differential gene expression using EdjeR. I am looking to determine differential gene expression between wild type (WT) cells and knockout cells (KO). Three biological replicates were grown for each cell line and RNA was harvested. The paired end reads were mapped using STAR. Exon counts were obtained using feature counts. The exon counts were then used for the R code. Would this be sufficient to determine differential gene expression between WT and KO?

library("edgeR")
library("gdata")
library("heatmaply")
library("ggplot2")
library("genefilter")
library("methylumi")

setwd("/Users/jwlandry2/Desktop/RNA-Seq\ ESC\ Data\ Analysis/R_Studio_Data_Sets")

counts=read.delim("BPTFKD_ESC_RNA_Seq_Counts_Final2.txt",  header=T,  row.names="Geneid")

head(counts)

group <- factor(c("WT","WT","WT","KO","KO","KO"))

dge = DGEList(counts=counts,group=group)

keep <- rowSums(cpm(dge)>2) >= 3
dge <- dge[keep, , keep.lib.sizes=FALSE]

#Normalization
y <- calcNormFactors(dge)
design <- model.matrix(~group)
y <- estimateDisp(y,design)
y$common.dispersion

plotMDS(y)
plotBCV(y)

#quasi-likelihood F-test
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef=2)
topTags(qlf)

#output to text file
df <- qlf$table
write.csv(df, 'qlf.csv')

#Get normalizaed CPMs
mtx <- cpm(y, log = TRUE, normalized.lib.sizes = TRUE)
mtx_to_plot <- varFilter(mtx, var.cutoff= 0.95)

#Correlation matrix
IAC <- mtx_to_plot %>% cor(. , use = "pairwise.complete.obs", method = "pearson")
heatmaply(IAC) %>% layout(margin=list(l=200,b=150))

#Dendrogram
plot(hclust(as.dist(1-IAC), method="ward"))
rna-seq • 1.2k views
ADD COMMENTlink modified 7 months ago • written 7 months ago by landrjos10

Where does your doubt lie about the analysis? 3 biological replicates is usually regarded as the bare minimum for differential expression analysis, so, good that you got that.

You mention that you have exon counts - was your goal differential splicing analysis (see '2.16 Alternative splicing' in the EdgeR User Guide)? Why did you not summarise the exon-level counts to the gene level?

ADD REPLYlink written 7 months ago by Kevin Blighe43k

Hi Kevin,

The goal was not to determine differences in splicing. I used rMATs to do that. I am just looking for differential transcript abundance. I just want to make sure my normalization and F-test sequence is valid. I get a reasonable number of genes, which reasonable pValues, so I don't think there is a problem. I just want some people with more experience with EdgeR to look it over to make sure I am not doing something stupid. It would be nice when I publish this paper, and the corresponding R code, that someone does not find a flaw after the fact.

Can you help me out?

Also, what do you mean by Exon-level counts to the gene level? I summed all exon counts to the single gene level prior to feeding the counts into EdgeR. So I only have total gene exon counts in the EdgeR analysis.

Best,

Joe

ADD REPLYlink modified 7 months ago • written 7 months ago by landrjos10
1

Hey Joe, your code looks fine where EdgeR is concerned. For the downstream parts, I would just have the following comments:

  1. why do you generate a correlation heatmap of all log CPM-normalised counts after filtering for genes of low variance? Usually, people generate a heatmap of the statistically significant genes. One may perform hierarchical clustering, PCA, etc on the log CPM counts for the purposes of QC
  2. when you perform hierarchical clustering with hclust(), you should use ward.D2. 'ward' does not actually implement the true Ward's linkage metric (an obscure bug in R).
ADD REPLYlink written 7 months ago by Kevin Blighe43k

Hi Kevin,

Regarding point 1....can you show me the changes you would suggest? Can you make the edits and paste them back into the field so that I can see how you make the changes....

Best,

Joe

ADD REPLYlink written 7 months ago by landrjos10

Please use the ADD REPLY / ADD COMMENT buttons when adding further details or addressing questions about your answers. The answer box should be reserved to answers to the original question.

ADD REPLYlink written 7 months ago by h.mon25k

Hey Joe, it may first help to understand the purpose of your study(?) On my point #1, one would usually subset your mtx object to include only genes that are statistically significantly differentially expressed, and then generate a heatmap from this subsetted matrix using gplots, pheatmap, ComplexHeatmap, etc. packages. The idea here is to see if the statistically significantly differentially expressed genes can segregate your conditions of interest via clustering. If they can, then these genes are of immediate [clinical] interest.

ADD REPLYlink modified 7 months ago • written 7 months ago by Kevin Blighe43k

Hi Kevin,

I have 2 conditions wild type (WT) and knockout (KO). I would like determine if the differential gene expression observed between WT and KO segregate the two groups using clustering or by a denditogram. Basically just as you mentioned in your comment above. I removed the correlation matrix because I would just need a denditogram for the paper. Can you suggest some edits to the relevant code below...

#Get normalizaed CPMs
mtx <- cpm(y, log = TRUE, normalized.lib.sizes = TRUE)
mtx_to_plot <- varFilter(mtx, var.cutoff= 0.95)

#Dendrogram
plot(hclust(as.dist(1-IAC), method="ward.D2"))

Also can you take a look at my addition of the multiple testing correction? Is this correct to do the FDR from EdgeR and output in the .csv file. I get a output file that looks to be correct but I would not know if there is an error or not.

#quasi-likelihood F-test
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef=2)
topTags(qlf)

#output to text file
df <- topTags(qlf, n=10000, adjust.method="BH", sort.by="PValue")
write.csv(df, 'qlf.csv')

Any help you could provide would be great....

Best,

Joe

ADD REPLYlink modified 7 months ago by Kevin Blighe43k • written 7 months ago by landrjos10

Hi Kevin,

I was wondering if you could provide some feedback on my EDGER code, and its application to my specific experiment as outlined below. Any help would be appreciated.

Best,

Joe

ADD REPLYlink written 7 months ago by landrjos10

Hey Joe, I do not see anything unusual about your code. You are only outputting 10,000 tags, though - are all of those statistically significant?

The next thing is to isolate the genes that are statistically significant from your df object, and then subset your mtx object to include only these genes.

I show different ways of plotting here: A: Hierarchical Clustering in single-channel agilent microarray experiment

ADD REPLYlink written 7 months ago by Kevin Blighe43k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2043 users visited in the last hour