Hello,
I am performing RNA-Seq data analysis to find differentially expressed genes in my data. The data is from inoculated leaves of Barley. The data has 5 mutants and 6 time-points. Each of the mutants and each time-point has 3 technical replicates brining up the total number of samples to 90. There are roughly 40,000 genes in Barley that I wish to probe for differential expression. I have obtained the gene read counts using RSEM. I have some questions about the analysis.
1) My goal is to carry out differential gene analysis, gene expression clustering and construct gene regulatory network from the expression data. For the DGE analysis, I want to find genes which are differentially expressed within each mutant across different time-points. Also I wish to find genes which are differentially expressed among different mutants at a given time-point. I have used the function rsem-generate-data-matrix to combine the counts from the 90 samples. I have read this count data in R. But I cannot understand how I will obtain the desired differential expression p-values and log2FC. Here is the code I have written.
samples <- read.csv("counts.csv", header=TRUE,row.names=1)
# Selecting genes by counts
counts = subset(samples, rowSums(samples > 1) >= 3)
time=c(0, 16, 20, 24, 32, 48)
gen=c("bln1", "dm", "mla6", "rar3", "wt")
cols=data.frame(genotype=rep(gen, each=3*6), timepoint=rep(rep(time, each=3),5), replication=rep(c(1,2,3),30))
cols$timepoint = as.factor(cols$timepoint)
cols$replication = as.factor(cols$replication)
rownames(cols) = colnames(counts)
dds=DESeqDataSetFromMatrix(countData = counts, colData = cols, design = ~ genotype * timepoint * replication)
dds=DESeq(dds)
results(dds)
# Extracting and writing normalized counts
temp = counts(dds, normalized=TRUE)
norm_counts = data.frame(gene = rownames(temp), temp, row.names = NULL)
I think I have to alter the design for each kind of analysis, but I am unsure how to do so. Does the DESeq() function automatically perform normalization? In the code above, I have created the variable "dds" using the "counts" which is not normalized. Do I have to perform normalization and then create the "dds" variable? Here is the link for the counts file.
2) For gene expression clustering and gene regulatory network should we perform rst or vst? Also should the transformation be performed on the raw count or on the normalized counts?
3) I am also not quite sure what conclusion can be drawn from the MA plot.
4) What is the need for shrinkage?
Thank you
Answering some of your questions (you had too many, not a good practice):
Yes, DESeq() automatically perform normalization, you create the DESeqDataSet with counts, as shown on DESeq2 vignette.
For clustering, I use rlog(). rlog() transforms count data, but as it is a DESeq2 function, it also accepts a DESeqDataSet object. I don't know what kind of "gene regulatory network" analysis you have in mind, so I don't know which transformation (if any) is best.
You said you used RSEM to obtain the counts. RSEM primary goal is to obtain transcript level cunts, but DESeq2 is designed to perform differential gene (not transcript) expression. Did you summarize your RSEM results into gene counts?
P.S.: look at the time-course design from this BioConductor RNAseq workflow.
Thank you for your reply.
RSEM generates both counts at the gene level and also at the transcript level. For this experiment I am planning to use only the gene counts and not the transcripts counts.