Question

FPKM or rlog from DEseq2 for machine learning analysis

3

Entering edit mode

6.5 years ago

Silvia ▴ 30

I have RNAseq data that I have normalized using RSEM with FPKM and when I make a boxplot per sample to see the normalization plotting the log2(FPKM+0.001), the boxplot is not very good: https://ibb.co/gYgNLm On the other hand, when I plot the normalization using DEseq2 package with rlog, the boxplot looks pretty good: https://ibb.co/jwuRfm

So my questions are: 1) Is it normal the boxplot from FPKM or something is wrong with the data? 2) What is more appropriate to use for machine learning techniques such as random forest, the log2(FPKM+0.001) or the rlog normalization from DEseq2?

Thank you!!

RNA-Seq normalization FPKM DEseq2 • 6.0k views

ADD COMMENT • link updated 5.2 years ago by Charles Warden 8.2k • written 6.5 years ago by Silvia ▴ 30

0

Entering edit mode

thanks Kevin,

I obtained the FPKM values with RSEM using the same count matrix that I use to calculate the rlog. Do you have any idea of why these values seem wrong?

Also, I have plotted the same values from here https://xenabrowser.net/datapages/?cohort=GTEX using the GTEX data and same value (log2(FPKM +0.001) and the plot looks pretty similar to my one. I would like to make both data sets comparable so that is why I follow similar pipeline that the project there, any idea of what is going on?

Thank you very much!!

ADD REPLY • link 6.5 years ago by Silvia ▴ 30

1

Entering edit mode

Hey again Silvia, you probably should have added this as a comment to my answer above, but not to worry! :)

The method of normalisation that produces FPKM expression values has come under criticism in recent years, in part due to the issues that you have found when plotting these counts. FPKM does not manage very well the fact that, in RNA-seq, there can be a very wide range of count distribution in the samples being studied. As Igor mentioned, there is neither any cross-sample normalisation performed while producing FPKM expression values.

In your figure, samples C3 and H12 appear problematic.

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Upps, I thought I replied directly to you! :)

Ok, I see the idea. Thank you very much for your help!!

Silvia

ADD REPLY • link 6.5 years ago by Silvia ▴ 30

1

Entering edit mode

Absolutely no problem. You can reply to this thread again to get my attention if you need more help!

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

score 3 · Answer 1 · 2019-02-11

While the exact normalization can be up for debate (FPKM, CPM, etc.), having independence between your training and validation datasets is important.

If the DESeq2 rlog values are affected by the overall set of samples, you could be violating your independence for validation if your normalization can be carried out independently for each sample.

For example, consider the DESeq2 vignette:

rld <- rlog(dds, blind=FALSE)

The command above is used in the rlog section.

To make your life easier, I'll copy over the upstream parts of the example code:

library("pasilla")
pasCts <- system.file("extdata",
                      "pasilla_gene_counts.tsv",
                      package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
                       "pasilla_sample_annotation.csv",
                       package="pasilla", mustWork=TRUE)
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
coldata <- read.csv(pasAnno, row.names=1)
coldata <- coldata[,c("condition","type")]
rownames(coldata) <- sub("fb", "", rownames(coldata))
cts <- cts[, rownames(coldata)]
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds <- DESeq(dds)
rld <- rlog(dds, blind=FALSE)

(I apologize that this is cut-and-paste from the website: however, a big thanks to Mike Love for making this a lot easier for me to show!)

Now, let's try using duplicates (instead of total n=7):

dds2 <- DESeqDataSetFromMatrix(countData = cts[,c(1,2,4,5)],
                                colData = coldata[c(1,2,4,5),],
                                design = ~ condition)
dds2 <- DESeq(dds2) #I think you can skip this...
rld2 <- rlog(dds2, blind=FALSE)

As you can see from the head(assay(rld)) and head(assay(rld2)) commands, the rlog values are different with different sets of samples (which I think is undesirable for what you are describing)

You can also see that the values still vary after setting blind=TRUE (even though I would guess it wouldn't add a huge bias, and this is a small demo dataset with clear differences to begin with):

rld.blind <- rlog(dds, blind=TRUE)
rld2.blind <- rlog(dds2, blind=TRUE)

That said, if you are concerned about the variance stabilization (because the lower counts tend to have more variability), I think there are ways where a log-transformed FPKM values is arguably kind of like the rlog transformation (and there are probably other strategies that avoid normalizing expression in ways that vary depending upon the total samples considered).

For example, lets say you are using log2(FPKM + 0.1) values. If all FPKM values are below 0.1, you might want to exclude those, but I'm guessing they will also be removed during feature selection (filtering more genes overall). However, the point is that the expression value is calculated independently for each sample.

If you choose a low rounding factor, you may see a increase in variability for the lower expressed genes. However, one solution you may want to consider is increasing the rounding threshold (such as using FPKM of 1 instead of 0.1). Or, if you end up selecting a lower throughput assay with a smaller number of features (first identified from your high-throughput screen), then your scoring method that costs less to test in more samples would be different :)

I also understand that the vignette says "The variance stabilizing and rlog transformations are provided for applications other than differential testing, for example clustering of samples or other machine learning applications," but I think they are mostly talking about a comparison of rlog to the normalized counts within DESeq2 (which I am OK with, but I would prefer to have a normalization that can be independently calculated for each sample).

For example, in the DESeq2 vignette, this is the example for the log-transformed counts:

ntd <- normTransform(dds)
library("vsn")
meanSdPlot(assay(ntd))

However, if you increase the rounding threshold (using a pseudocount of 100 instead of a pseudocount of 1), you get rid of the bump in increased variance:

ntd100 <- normTransform(dds, pc=100)
meanSdPlot(assay(ntd100))

In fact, that plot looks really similar to meanSdPlot(assay(rld)), at least on my computer (they are similar to each other, but different than the plots on-line with the smaller pseudocount).

So, even when using the context of that code, using a higher rounding threshold (or pseudocount) can help stabilize variance (and help with filtering genes with lower counts / expression).

score 1 · Answer 2 · 2017-10-30

Hi Silvia,

I would definitely recommend the regularised log counts for use with your downstream machine learning techniques, such as Random Forest. The distribution of your logged FPKM counts does not appear to be correct, and this may therefore bias the Random Forest results.

One extra thing about which you need to be careful: some machine learning techniques may automatically log your data within the function, so, be sure that it does not do any extra transformations about which you're unaware.

Kevin

score 1 · Answer 3 · 2017-10-30

The FPKM values are only normalized per sample. The DESeq2 rlog values are normalized across all samples. Thus, the rlog values are more appropriate to compare the different samples.

You could argue that the DESeq2 normalized counts are even better because they have not been adjusted to reduce variance.