Question: FPKM or rlog from DEseq2 for machine learning analysis
1
gravatar for Silvia
22 months ago by
Silvia10
Silvia10 wrote:

I have RNAseq data that I have normalized using RSEM with FPKM and when I make a boxplot per sample to see the normalization plotting the log2(FPKM+0.001), the boxplot is not very good: https://ibb.co/gYgNLm On the other hand, when I plot the normalization using DEseq2 package with rlog, the boxplot looks pretty good: https://ibb.co/jwuRfm

So my questions are: 1) Is it normal the boxplot from FPKM or something is wrong with the data? 2) What is more appropriate to use for machine learning techniques such as random forest, the log2(FPKM+0.001) or the rlog normalization from DEseq2?

Thank you!!

rna-seq deseq2 fpkm normalization • 1.9k views
ADD COMMENTlink modified 6 months ago by Charles Warden7.2k • written 22 months ago by Silvia10

thanks Kevin,

I obtained the FPKM values with RSEM using the same count matrix that I use to calculate the rlog. Do you have any idea of why these values seem wrong?

Also, I have plotted the same values from here https://xenabrowser.net/datapages/?cohort=GTEX using the GTEX data and same value (log2(FPKM +0.001) and the plot looks pretty similar to my one. I would like to make both data sets comparable so that is why I follow similar pipeline that the project there, any idea of what is going on?

Thank you very much!!

ADD REPLYlink written 22 months ago by Silvia10
1

Hey again Silvia, you probably should have added this as a comment to my answer above, but not to worry! :)

The method of normalisation that produces FPKM expression values has come under criticism in recent years, in part due to the issues that you have found when plotting these counts. FPKM does not manage very well the fact that, in RNA-seq, there can be a very wide range of count distribution in the samples being studied. As Igor mentioned, there is neither any cross-sample normalisation performed while producing FPKM expression values.

In your figure, samples C3 and H12 appear problematic.

ADD REPLYlink modified 10 months ago • written 22 months ago by Kevin Blighe47k

Upps, I thought I replied directly to you! :)

Ok, I see the idea. Thank you very much for your help!!

Silvia

ADD REPLYlink written 22 months ago by Silvia10
1

Absolutely no problem. You can reply to this thread again to get my attention if you need more help!

ADD REPLYlink written 22 months ago by Kevin Blighe47k
1
gravatar for Kevin Blighe
22 months ago by
Kevin Blighe47k
Kevin Blighe47k wrote:

Hi Silvia,

I would definitely recommend the regularised log counts for use with your downstream machine learning techniques, such as Random Forest. The distribution of your logged FPKM counts does not appear to be correct, and this may therefore bias the Random Forest results.

One extra thing about which you need to be careful: some machine learning techniques may automatically log your data within the function, so, be sure that it does not do any extra transformations about which you're unaware.

Kevin

ADD COMMENTlink modified 6 months ago • written 22 months ago by Kevin Blighe47k
1
gravatar for igor
22 months ago by
igor8.1k
United States
igor8.1k wrote:

The FPKM values are only normalized per sample. The DESeq2 rlog values are normalized across all samples. Thus, the rlog values are more appropriate to compare the different samples.

You could argue that the DESeq2 normalized counts are even better because they have not been adjusted to reduce variance.

ADD COMMENTlink written 22 months ago by igor8.1k
1
gravatar for Charles Warden
6 months ago by
Charles Warden7.2k
Duarte, CA
Charles Warden7.2k wrote:

While the exact normalization can be up for debate (FPKM, CPM, etc.), having independence between your training and validation datasets is important.

If the DESeq2 rlog values are affected by the overall set of samples, you could be violating your independence for validation if your normalization can be carried out independently for each sample.

For example, consider the DESeq2 vignette:

rld <- rlog(dds, blind=FALSE)

The command above is used in the rlog section.

To make your life easier, I'll copy over the upstream parts of the example code:

library("pasilla")
pasCts <- system.file("extdata",
                      "pasilla_gene_counts.tsv",
                      package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
                       "pasilla_sample_annotation.csv",
                       package="pasilla", mustWork=TRUE)
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
coldata <- read.csv(pasAnno, row.names=1)
coldata <- coldata[,c("condition","type")]
rownames(coldata) <- sub("fb", "", rownames(coldata))
cts <- cts[, rownames(coldata)]
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds <- DESeq(dds)
rld <- rlog(dds, blind=FALSE)

(I apologize that this is cut-and-paste from the website: however, a big thanks to Mike Love for making this a lot easier for me to show!)

Now, let's try using duplicates (instead of total n=7):

dds2 <- DESeqDataSetFromMatrix(countData = cts[,c(1,2,4,5)],
                                colData = coldata[c(1,2,4,5),],
                                design = ~ condition)
dds2 <- DESeq(dds2) #I think you can skip this...
rld2 <- rlog(dds2, blind=FALSE)

As you can see from the head(assay(rld)) and head(assay(rld2)) commands, the rlog values are different with different sets of samples (which I think is undesirable for what you are describing)

You can also see that the values still vary after setting blind=TRUE (even though I would guess it wouldn't add a huge bias, and this is a small demo dataset with clear differences to begin with):

rld.blind <- rlog(dds, blind=TRUE)
rld2.blind <- rlog(dds2, blind=TRUE)

That said, if you are concerned about the variance stabilization (because the lower counts tend to have more variability), I think there are ways where a log-transformed FPKM values is arguably kind of like the rlog transformation (and there are probably other strategies that avoid normalizing expression in ways that vary depending upon the total samples considered).

For example, lets say you are using log2(FPKM + 0.1) values. If all FPKM values are below 0.1, you might want to exclude those, but I'm guessing they will also be removed during feature selection (filtering more genes overall). However, the point is that the expression value is calculated independently for each sample.

If you choose a low rounding factor, you may see a increase in variability for the lower expressed genes. However, one solution you may want to consider is increasing the rounding threshold (such as using FPKM of 1 instead of 0.1). Or, if you end up selecting a lower throughput assay with a smaller number of features (first identified from your high-throughput screen), then your scoring method that costs less to test in more samples would be different :)

I also understand that the vignette says "The variance stabilizing and rlog transformations are provided for applications other than differential testing, for example clustering of samples or other machine learning applications," but I think they are mostly talking about a comparison of rlog to the normalized counts within DESeq2 (which I am OK with, but I would prefer to have a normalization that can be independently calculated for each sample).

For example, in the DESeq2 vignette, this is the example for the log-transformed counts:

ntd <- normTransform(dds)
library("vsn")
meanSdPlot(assay(ntd))

However, if you increase the rounding threshold (using a pseudocount of 100 instead of a pseudocount of 1), you get rid of the bump in increased variance:

ntd100 <- normTransform(dds, pc=100)
meanSdPlot(assay(ntd100))

In fact, that plot looks really similar to meanSdPlot(assay(rld)), at least on my computer (they are similar to each other, but different than the plots on-line with the smaller pseudocount).

So, even when using the context of that code, using a higher rounding threshold (or pseudocount) can help stabilize variance (and help with filtering genes with lower counts / expression).

ADD COMMENTlink modified 6 months ago • written 6 months ago by Charles Warden7.2k

If it helps, here are the plots for what I was describing:

library("vsn")

log-counts (pseudocount of 1):

ntd <- normTransform(dds)
png("log-counts_PC1.png")
meanSdPlot(assay(ntd))
dev.off()

log counts, pseudocount of 1

rlog:

rld <- rlog(dds, blind=FALSE)
png("rlog.png")
meanSdPlot(assay(rld))
dev.off()

rlog

log-counts (pseudocount of 100):

ntd100 <- normTransform(dds, pc=100)
png("log-counts_PC100.png")
meanSdPlot(assay(ntd100))
dev.off()

log counts, pseudocount of 100

In other words, rlog normalization and the higher pseudocount of 100 both get rid of the increased variance around 5000 on the x-axis.

P.S. Thank you for these instructions on how to load an image!

ADD REPLYlink modified 6 months ago • written 6 months ago by Charles Warden7.2k
1

I apologize for all the messages. However, I noticed something else, and I also thought it might be a good idea to separate out some of these points.

So, if you are curious, you can see more in this blog post.

Otherwise, I will focus on updating that blog post (if needed), instead of adding more comments here. I am certainly glad to respond to questions, but I didn't want to bother those who might not want to keep getting updates about this part of the thread :)

ADD REPLYlink modified 6 months ago • written 6 months ago by Charles Warden7.2k

In the process of writing a long response, I think I over-looked that the original question was for RSEM FPKM (instead of unique read FPKM, calculated from the raw read counts). That probably matters to some extent (and I apologize for the confusion), but the plots above at least show the concept can be applied to CPM values (which are probably from unique counts).

I didn't use this same strategy, but there was a paper related to the "rounding factors" for log2(FPKM + X) transformation (where I am saying X could be 0.1 or 1), and an associated blog post where comments can be added:

http://bioinfo.aizeonpublishers.net/content/2013/6/bioinfo285-292.pdf

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

Also, you can have over-corrected expression if you focus too much on the box-plots / density distribution. I think some sort of higher rounding factor probably should have been used (to avoid values like -10), but (with the possible exceptions of H12 and C3) it wasn't immediately obvious to me that the first box-plot differences represented an unacceptable amount of variability (in the user's question, rather than one of the DESeq2 demo datasets used for the variance stabilization plots).

ADD REPLYlink modified 6 months ago • written 6 months ago by Charles Warden7.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 660 users visited in the last hour