Question

Differential gene expression from 1 replicate using RPKM and edgeR

0

Entering edit mode

4.6 years ago

Kai_Qi ▴ 130

Hi:

I am starting in learning RNA-seq based differential gene expression analysis in the aim to understand the function of the gene I am studying.

I searched literatures and downloaded the database from the literature and tried to use the package I learned to go through the whole process. However, I found that most the database contains only 1 replicate (control Vs shRNA knockdown) which the currently learned package will not generate any differential expressed genes(https://f1000research.com/articles/5-1408 ). I used the STAR to align to get bam files and Rsubread to count the reads.

In the end I turned to the authors methods to get the information. I saw they used The RPKM measure (reads per kilobases of exon model per million reads) and then use EdgeR to analyze the differential expressed genes.

Does anyone know softwares that can done RPKM analysis from bam or txt files? The paper of RPKM mentioned in Nature Methods contains a lot of statistical analysis and it will be more practical if i can go through some softwares and get the biological information.

Appreciate a lot for your help.

Cai

RNA-Seq R • 2.7k views

ADD COMMENT • link updated 4.6 years ago by i.sudbery 19k • written 4.6 years ago by Kai_Qi ▴ 130

1

Entering edit mode

I have so much to comment but so little time. RPKM is bad practice, one replicate is bad design. I don't have access to f1000 so I can't see the paper.

ADD REPLY • link 4.6 years ago by Asaf 10k

0

Entering edit mode

Thanks for your reply. The title of the f1000 paper is: RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR [version 3; peer review: 3 approved]

I know it is a bad design to have only 1 replicates. My point is to get some clues for the function of the genes I am studying.

Thanks again,

Cai

ADD REPLY • link 4.6 years ago by Kai_Qi ▴ 130

0

Entering edit mode

Can it be accessed outside of f1000? Can you try and share another link?

ADD REPLY • link 4.6 years ago by Asaf 10k

1

Entering edit mode

I fixed the link. Biostars has the habit of including the ) after URLs in parentheses as part of the URL.

ADD REPLY • link 4.6 years ago by Devon Ryan 104k

0

Entering edit mode

OK. Appreciate a lot

ADD REPLY • link 4.6 years ago by Kai_Qi ▴ 130

0

Entering edit mode

This is about analyzing the dex I have tried. It needs 2 replicates.

I can send it to your email.

What I am asking is an approach for PRKM/FRKM measurement and the subsequent quantification.

ADD REPLY • link 4.6 years ago by Kai_Qi ▴ 130

0

Entering edit mode

The piece of software that generates read counts usually provide FPKM. RSEM for instance return it. Again, FPKM shouldn't usually be used, especially for differential expression.

ADD REPLY • link 4.6 years ago by Asaf 10k

0

Entering edit mode

Thank you very much for your advice! I will figure out a method to dig out the data from published database

ADD REPLY • link 4.6 years ago by Kai_Qi ▴ 130

0

Entering edit mode

Yes, but, do not forget 2 key things here:

FPKM and RPKM are not suitable for differential expression analyses
studies with just 1 replicate are not suitable for differential expression analyses

So, the data is 'not good'

ADD REPLY • link 4.6 years ago by Kevin Blighe 87k

0

Entering edit mode

OK. I managed to get 2 database from 2 different papers which has 1 replicate from each. Now I can practice my analysis with CPM.

ADD REPLY • link 4.6 years ago by Kai_Qi ▴ 130

score 4 · Answer 1 · 2019-09-28

4

Entering edit mode

4.6 years ago

i.sudbery 19k

Hi qicaibiology,

Authors have managed to publish many bad RNAseq experiments in the past. In very early days this was at least part due to a lack of knowledge in the field, but even once things were understood in the field, papers are often reviewed by people who do not have this specialist understanding.

The need to do things more than once (replication) is one of the very first things we learn about science in school. It is no different because we are using big fancy technology. Knowing that I got 367 reads for a gene when a took a sample from condition A and 472 when I took a sample from condition B is of little use if I don't know how much samples from the sample condition vary from sample to sample. Now is true that if I sequenced a same library made from the same sample many times, I'd find that the variane of gene with an average of 367 was 367. But we are interested in much samples vary from each other, not how repeated mesaurements of the same sample vary.

The defects with RPKM are more subtle and arise from the fact that

the sum of the total RPKM across all genes in a sample is not constant from sample to sample
Even if it were, that would make RPKM a compositional measure - that is the RPKM of one gene is affected by the RPKM in other genes
We only approximately know the distribution of RPKM. In the early days, many aruged that log RPKM was approximately normally distributed, but for a long time people argued about how approximate that approximation was. We now know what the precise distribution of counts is, and so that is a much better option for differential expression (although not for other analyses nececssarily).

edgeR will agree to process single replicate data, even though the results will not mean much (and reacent versions will warn you of such), but it will only do so from count data, not from RPKM. So either you are misunderstanding the methods in the paper you are referencing, or (more likely) the authors have not recoded what they did properly. This is one reason why many people are pressing for the release of analysis code instead of a written description of what was done.

There maybe ways to get some meaning from RPKM data, but it would not be using standard software, or at least no in the way it was intended, and the results would still not be fully robust. I know of no software designed to analyse RPKM data for differential expression.

All this is a long winded way of saying that a lot of publicly available RNA-seq data is not good data and you shouldn't waste your time on it. Not all science that got published should have done, and not all science that was published some time ago would get published now. Accept that not all published data is going to be useful.

ADD COMMENT • link 4.6 years ago by i.sudbery 19k

0

Entering edit mode

very much appreciated! I managed to get 2 database from 2 different papers which has 1 replicate from each. Now I can practice my analysis with CPM. DO you think it is doable?

ADD REPLY • link 4.6 years ago by Kai_Qi ▴ 130

0

Entering edit mode

This might be okay for practice bare in mind the following:

The samples need to be pretty similar between the two studies. The number of reps in each study needs to be the similar. Specially having 1 rep each of conditions A & B in study 1 and 1 rep each in study 2 might work. But 2 reps of condition A in study 1 and 2 repos of condition B in study 2 will not work.
You will need to do a paired analysis. The edgeR/limma/DESeq2 manuals all explain how to do this, but basically it comes down to using the design formula ~ study + condition and then performing the test of the condition coefficient. This will attempt to account for the study to study effects.
You need to use raw counts, rather than CPM.
Because the study to study variation might be larger than the condition to condition variation, you might not find as many genes as you would with a single, well designed and well powered experiment.

ADD REPLY • link 4.6 years ago by i.sudbery 19k

0

Entering edit mode

Thanks for your advice. But I saw in the manual that the raw count need to be transformed into CPM? Do you think it is not necessary?

Thanks,

ADD REPLY • link 4.6 years ago by Kai_Qi ▴ 130

1

Entering edit mode

Which manual is this? Normally we would convert counts to CPM in order to do interpretive, or qualitative analysis, but not differential expression. Indeed, looking at your script below, I don't think you are using CPM for differential expression. Much of what you have written below is (very neccessary) quality control. If we extract the code that is actaully doing the differential expression we get the following:

# load data
setwd("/Users/cqi/Desktop/kdm5bkd/")
files <- c("shCTRRep1.txt", "shCTRRep2.txt", "shKdm5bRep1.txt", "shKdm5bRep2.txt")
x <- readDGE(files, columns = c(1,3))
samplenames <- substring(colnames(x), 1, nchar(colnames(x)))

# set up design
group <- as.factor(c("ctrl", "ctrl", "knockdown", "knockdown"))
x$samples$group <- group
design <- model.matrix(~0+group)
contr.matrix <- makeContrasts(
  ctrlvsknockdown = ctrl-knockdown,
  levels = colnames(design))

# Do the DE
x <- calcNormFactors(x, method = "TMM")
v <- voom(x, design, plot = TRUE)
vfit <- lmFit (v, design)
vfit <- contrasts.fit(vfit, contrasts = contr.matrix)
efit <- eBayes(vfit)

# Ouput the results
summary(decideTests(efit))

Once we simplify the code to this core DE code, we can see that you are not really using CPM. Which is for the best. I'm certain that nowhere in the limma or edgeR manuals does it tell you to use CPM for the DE analysis.

Now if we look at the above code for establishing the design you'll see:

# set up design
group <- as.factor(c("ctrl", "ctrl", "knockdown", "knockdown"))
x$samples$group <- group
design <- model.matrix(~0+group)
contr.matrix <- makeContrasts(
  ctrlvsknockdown = ctrl-knockdown,
  levels = colnames(design))

In my above advice, you'll note that I said that you definitely needed to use a paired design for this experiment. However this is not the design you have used in this analysis, which is a straight forward two group, unpair analysis.

In effect, what you are doing here is calculating the mean and variance of the two control samples, calculating the mean and variance of the knockdown samples and asking if the two are different. This is not the correct thing to do here, because we don't know if the two controls from different studys are comparable to each other, nor if the two knockdowns are comparable. Consider the following situation for a gene in such an expriment:

Gene A
-----------

           Control    siKdm5b
Dataset 1  5          10
Dataset 2  20         40

In each case the knockdown cells have twice the expression this gene as the control cells. However, if we calculate the mean and (approximate) variance, we'd say the control mean was 12.5+/-15 and the expression of gene A in siKdm5b was 30+/-10. At an intutive level you can see that these intervals overlap, and so are unlikely to be significant.

However, when we do a paired design we calcualte the difference between the control and knockdown in Dataset 1, and then the difference between the control and knockdown in dataset 2, and then take the average, and ask if its different from 0.

Gene A
-----------

           Control    siKdm5b    FoldChange
Dataset 1  5          10         2
Dataset 2  20         40         2

We can see that in both cases the FoldChange is 2 and the variance on that is 0. Thus this is clearly different from 0.

You can read about implementing paired designs on page 43 of the limma users guide, page 38 of the edgeR users guide, or this part of the DESeq2 users guide.

Basically it comes down to telling limma to account for which study a sample comes from. First you have to tell limma which study each sample comes from. You might do that like (if that really is right for which study each comes frmo):

study = c("study1", "study2", "study1", study2")

Now you need to tell limma to use this information. This means changing your design formula from ~0+group to ~0 + study + group

None of this will guarantee that you will find differential genes. Combining different studies like this is a long shot (as I note above), and it might be that the two studies are just too different to get any good results from combining them.

ADD REPLY • link 4.6 years ago by i.sudbery 19k

1

Entering edit mode

Thank you for your detailed and patient explanation, I modified the code according to your advice and get it work. At the same time get a deeper understanding of the comparison.

The manual I used is here (Manual) from page 7 to 8 it said when using the filter function from edgeR, it uses the CPM.

Thanks again for your help. It really helps a lot.

ADD REPLY • link 4.6 years ago by Kai_Qi ▴ 130

1

Entering edit mode

Yes, we use CPM to decide which genes to keep in the analysis, but then do the actual analysis with raw counts.

ADD REPLY • link 4.6 years ago by i.sudbery 19k

0

Entering edit mode

I ran through the steps in the manual, but get no significant changed genes. I can not figure out which part is wrong. my code are listed as:

setwd("/Users/cqi/Desktop/kdm5bkd/")
files <- c("shCTRRep1.txt", "shCTRRep2.txt", "shKdm5bRep1.txt", "shKdm5bRep2.txt")
read.delim(files[1], nrows = 5)
library(limma)
library(edgeR)
x <- readDGE(files, columns = c(1,3))
class(x)
dim(x)
samplenames <- substring(colnames(x), 1, nchar(colnames(x)))
samplenames
colnames(x) <- samplenames
group <- as.factor(c("ctrl", "ctrl", "knockdown", "knockdown"))
x$samples$group <- group
x$samples
library(Mus.musculus)
geneid <- row.names(x)
genes <- select(Mus.musculus, keys = geneid, columns = c("SYMBOL", "TXCHROM"), keytype = "ENTREZID")
dim(genes)
head(genes)
genes <- genes[!duplicated(genes$ENTREZID), ]
x$genes <- genes
x
cpm <- cpm(x)
lcpm <- cpm(x, log=TRUE)
L <- mean(x$samples$lib.size)*1e-6
M <- median(x$samples$lib.size)*1e-6
c(L, M)
table(rowSums(x$counts==0)==4)
keep.exprs <- filterByExpr(x, group = group)
x <- x[keep.exprs,, keep.lib.sizes=FALSE]
dim(x)
lcpm.cutoff <- log2(10/M + 2/L)
library(RColorBrewer)
nsamples <- ncol(x)
col <- brewer.pal(nsamples, "Paired")
par(mfrow=c(1,2))
plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0, 0.26), las=2, main="", xlab="")
title(main = "A. Raw data", xlab = "Log-cpm")
abline(v=lcpm.cutoff, lty = 3)
for (i in 2:nsamples) {
  den <- density(lcpm[,i])
  lines(den$x, den$y, col=col[i], lwd=2)
}
legend("topright", samplenames, text.col = col, bty = "n")
lcpm <- cpm(x, log=TRUE)
plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0, 0.26), las=2, main="", xlab="")
title(main = "B. Filetered data", xlab = "Log-cpm")
abline(v=lcpm.cutoff, lty = 3)
for (i in 2:nsamples) {
  den <- density(lcpm[,i])
  lines(den$x, den$y, col=col[i], lwd=2)
}
legend("topright", samplenames, text.col = col, bty = "n")
x <- calcNormFactors(x, method = "TMM")
x$samples$norm.factors
lcpm <- cpm(x, log=TRUE)
par(mfrow=c(1,2))
col.group <- group
levels(col.group) <- brewer.pal(nlevels(col.group), "Set1")
col.group <- as.character(col.group)
plotMDS(lcpm, labels = group, col=col.group)
design <- model.matrix(~0+group)
colnames(design) <- gsub("group", "", colnames(design))
design
contr.matrix <- makeContrasts(
  ctrlvsknockdown = ctrl-knockdown,
  levels = colnames(design))
contr.matrix
v <- voom(x, design, plot = TRUE)
v
vfit <- lmFit (v, design)
vfit <- contrasts.fit(vfit, contrasts = contr.matrix)
efit <- eBayes(vfit)
plotSA(efit)
summary(decideTests(efit))

But the final output is this:

summary(decideTests(efit)) ctrlvsknockdown Down 0 NotSig 16689 Up 0

I don't know why. At least the shRNA target gene itself should pop out. Can anyone help me figure it out?

ADD REPLY • link 4.6 years ago by Kai_Qi ▴ 130