Question: log transformation of rpkm
1
gravatar for elb
11 months ago by
elb160
Torino
elb160 wrote:

Hi guys, I have a question about how to define a gene to be expressed in RNA Seq analysis. Is it better to use log2(rpkm) or rpkm without log2 transformation to define a gene to be expressed? I know that the log is only a transformation so that log2(rpkm) will let to consider as expressed a gene with a higher rpkm with respect to the simple rpkm. However I think that rpkm alone leads to consider as expressed genes that have too few reads.

Some ideas about?

Thank you in advance

rna-seq • 939 views
ADD COMMENTlink modified 11 months ago by Charles Warden7.2k • written 11 months ago by elb160
5
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe48k
Kevin Blighe48k wrote:

If you just want to filter out genes of low expression prior to performing downstream analyses, then just go by the RPKM values and eliminate genes with mean RPKM < 10 or < 20, and / or those genes with a high frequency of 0 or NA values.

If your aim, however, is to define, for example, an expression signature for a particular tissue or condition, then transform your RPKM dataset to Z-scores via zFPKM package (R Programming Language) and then set cut-offs:

  • Z > 2 = expressed
  • Z < -2 = not expressed

Those cut-offs can, of course, be modified.

Note that neither RPKM, FPKM, nor CPM data is suitable for differential expression analysis, and neither are the logged transformations of these.

Kevin

ADD COMMENTlink modified 11 months ago • written 11 months ago by Kevin Blighe48k
1

Thank you Kevin! You help me a lot every time! Thank you again!

ADD REPLYlink written 11 months ago by elb160
1

While I have seen zFPKM used in at least one paper, I didn't realize that there was a zFPKM Bioconductor package. So, I am glad that Kevin pointed that out. However, that package says "Reference recommends using zFPKM > -3 to select expressed genes".

This alternative suggestion in the Bioconductor package is closer to what I would expect, if using FPKM of 0.1 as a rough approximation for expressed genes (or at least a rounding threshold to place less emphasis on high fold-change values in genes with low expression / counts). If prioritizing candidates for validation / future study, I might focus more on those with FPKM > 1 (possibly within a functionally relevant category), but I would expect FPKM = 1 to be closer to the mean among all genes. So, I am not sure that the standard use of |z-score| > 2 is necessarily the best strategy for defining expressed genes (there are probably a lot of expressed genes with 0 < Z < 2, for example), unless you have a separate category for your baseline (such as using normal expression for the z-score, and using disease samples to test for differences; although that could also be done with a more typical differential expression test).

ADD REPLYlink modified 11 months ago • written 11 months ago by Charles Warden7.2k
1

Thanks for the input Charles. Yes, the thresholds can of course be modified - there will be a lot of factors going into this. There are also many other ways to identify genes that are representative of a tissue/cell.

ADD REPLYlink written 11 months ago by Kevin Blighe48k
1

That is a good point that I didn't previously notice - if you have a panel of cell/tissue types, I could see how a z-score per-gene could be useful for identifying cell/tissue specific markers (that could also be true per-sample, which is what I thought was being asked about, but I don't believe I've tried that before and the disease-normal z-score that I mentioned would also be per-gene rather than per-sample).

ADD REPLYlink written 11 months ago by Charles Warden7.2k
0
gravatar for Charles Warden
11 months ago by
Charles Warden7.2k
Duarte, CA
Charles Warden7.2k wrote:

If you plot log2(FPKM+0.1) values, you will usually see two distributions (unless you are only looking at something like only lncRNA annotations or miRNA-Seq). While I don't typically think of this in terms of expressed and unexpressed, you could potentially choose a threshold where the density is minimal between those two peaks. In practice, if I need to narrow down candidates, I would typically increase the threshold for average for 50% FPKM (rather than call the gene "expressed" or "not expressed").

While I occasionally see value in performing a standard differential expression test for log2(FPKM+0.1) values (which is usually more conservative, if you are trying to narrow down candidates), I would usually use some combination of p-values calculated from counts with an independent assessment using the log2(FPKM+0.1) values for visualization. So, I think Kevin is right that there is value in using the counts, but it isn't quite correct to say "RPKM, FPKM, nor CPM data is suitable for differential expression analysis, and neither are the logged transformations of these" (even though I have seen lots of people say that, at least for RPKM/FPKM).

ADD COMMENTlink written 11 months ago by Charles Warden7.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2286 users visited in the last hour