log transformation of rpkm
2
2
Entering edit mode
5.5 years ago
elb ▴ 250

Hi guys, I have a question about how to define a gene to be expressed in RNA Seq analysis. Is it better to use log2(rpkm) or rpkm without log2 transformation to define a gene to be expressed? I know that the log is only a transformation so that log2(rpkm) will let to consider as expressed a gene with a higher rpkm with respect to the simple rpkm. However I think that rpkm alone leads to consider as expressed genes that have too few reads.

Some ideas about?

Thank you in advance

RNA-Seq • 8.3k views
ADD COMMENT
5
Entering edit mode
5.5 years ago

If you just want to filter out genes of low expression prior to performing downstream analyses, then just go by the RPKM values and eliminate genes with mean RPKM < 10 or < 20, and / or those genes with a high frequency of 0 or NA values.

If your aim, however, is to define, for example, an expression signature for a particular tissue or condition, then transform your RPKM dataset to Z-scores via zFPKM package (R Programming Language) and then set cut-offs:

  • Z > 2 = expressed
  • Z < -2 = not expressed

Those cut-offs can, of course, be modified.

Note that neither RPKM, FPKM, nor CPM data is suitable for differential expression analysis, and neither are the logged transformations of these.

Kevin

ADD COMMENT
1
Entering edit mode

Thank you Kevin! You help me a lot every time! Thank you again!

ADD REPLY
1
Entering edit mode

While I have seen zFPKM used in at least one paper, I didn't realize that there was a zFPKM Bioconductor package. So, I am glad that Kevin pointed that out. However, that package says "Reference recommends using zFPKM > -3 to select expressed genes".

This alternative suggestion in the Bioconductor package is closer to what I would expect, if using FPKM of 0.1 as a rough approximation for expressed genes (or at least a rounding threshold to place less emphasis on high fold-change values in genes with low expression / counts). If prioritizing candidates for validation / future study, I might focus more on those with FPKM > 1 (possibly within a functionally relevant category), but I would expect FPKM = 1 to be closer to the mean among all genes. So, I am not sure that the standard use of |z-score| > 2 is necessarily the best strategy for defining expressed genes (there are probably a lot of expressed genes with 0 < Z < 2, for example), unless you have a separate category for your baseline (such as using normal expression for the z-score, and using disease samples to test for differences; although that could also be done with a more typical differential expression test).

ADD REPLY
1
Entering edit mode

Thanks for the input Charles. Yes, the thresholds can of course be modified - there will be a lot of factors going into this. There are also many other ways to identify genes that are representative of a tissue/cell.

ADD REPLY
1
Entering edit mode

That is a good point that I didn't previously notice - if you have a panel of cell/tissue types, I could see how a z-score per-gene could be useful for identifying cell/tissue specific markers (that could also be true per-sample, which is what I thought was being asked about, but I don't believe I've tried that before and the disease-normal z-score that I mentioned would also be per-gene rather than per-sample).

ADD REPLY
0
Entering edit mode
5.5 years ago

If you plot log2(FPKM+0.1) values, you will usually see two distributions (unless you are only looking at something like only lncRNA annotations or miRNA-Seq). While I don't typically think of this in terms of expressed and unexpressed, you could potentially choose a threshold where the density is minimal between those two peaks. In practice, if I need to narrow down candidates, I would typically increase the threshold for average for 50% FPKM (rather than call the gene "expressed" or "not expressed").

While I occasionally see value in performing a standard differential expression test for log2(FPKM+0.1) values (which is usually more conservative, if you are trying to narrow down candidates), I would usually use some combination of p-values calculated from counts with an independent assessment using the log2(FPKM+0.1) values for visualization. So, I think Kevin is right that there is value in using the counts, but it isn't quite correct to say "RPKM, FPKM, nor CPM data is suitable for differential expression analysis, and neither are the logged transformations of these" (even though I have seen lots of people say that, at least for RPKM/FPKM).

ADD COMMENT

Login before adding your answer.

Traffic: 2008 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6