Question: Comparing FPKM values in different genes
0
gravatar for snp87
11 months ago by
snp8740
snp8740 wrote:

Hi all,

I am new to RNA-seq analysis and I wanted to ask about comparing expression in different genes. I apologise in advance that this question is quite basic. I used Cuffdiff to perform a differential expression analysis and validated several candidate genes identified using in situ hybridisation. While the validation of most genes confirmed the expected differential expression, it was difficult to decide on the cut-off for when no expression could be expected. For instance gene A had a FPKM of 1000 in cell1 and FPKM of 90 in cell2, validation showed expression in cell1 but not cell2. However, gene B had a FPKM of 80 in cell1 and FPKM of 3 in cell2, and validation showed expression in both cell1 and 2, though it was stronger in cell1.

Since FPKM is normalised for gene length, I assumed that the FPKM of different genes should be comparable. Am I wrong in thinking this? And could there be any reason that validation using in situ hybridisation of some genes show no expression when there are transcripts according to the transcriptomic data other than the sensitivity of the probe in detecting the gene?

Thanks so much!

rna-seq validation of genes • 1.2k views
ADD COMMENTlink modified 11 months ago • written 11 months ago by snp8740

Thanks so much for your reply, Kevin. You make some valid points. Just a few clarifications, though. My purposes for RNA-seq was to perform a differential expression analysis between 2 group of closely related cells. While that was the main aim, now that I have the datset, I want to see what method might be best to predict if a gene is expressed or not in the transcriptome. I did HTSeq counts and analysed the data using DESeq2 as well, and I noticed the same issues I mentioned with this pipeline as well. Since the count data does not take into consideration the gene length or the sequencing depth, which were different in the samples I thought it was easier to make a comparison of the expression of different genes in the same sample based on the FPKM (but I guess with the issues with normalisation used to calculate FPKM this is not accurate).

Relating my question with the count matrix generated by HTSEq and the DESeq2 analysis, how can you decide what number of counts you'll be able to assume is negligible expression (not biologically relevant)? 2 genes X and Y have approximately 2000bp. Gene X had counts of 5-54 in one sample (with 5 replicates) and Gene Y had counts of 20-100 in the same sample (with 5 replicates). Gene X was expressed when validated with in situ hybridisation while Gene Y was not expressed - do you think it's more related to sensitivity of the RNA-seq vs in situ hybridisation for detecting the genes or does it point to a problem with my data set. The replicates were multiplexed and sequenced in the same run but each replicate had different sequencing depth (which I've read is quite common). Also just to mention I am working with low-RNA (of RIN>8) quantities - 2ng (2000g) of RNA was used for cDNA synthesis and amplification and subsequent library synthesis.

Thanks so much!

ADD REPLYlink written 11 months ago by snp8740

I see, your aim is to literally just determine expressed and non-expressed. Given that you have FPKM, you could transform the data to the Z scale using the zFPKM function in R, which has actually been received positively, from what I have seen so far. Going by the Z-scale, you will then have a more intuitive way of gauging expressed / non-expressed because it may then be as simple as:

  • Z-score = 0 is expressed
  • Z-score > 3 = highly expressed
  • Z-score < -3 = not expressed

Coincidentally, it is through this logic that some have been developing cellular deconvolution methods from RNA-seq data.

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe46k

Thanks so much for the suggestion - I will try that

ADD REPLYlink written 11 months ago by snp8740
1
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe46k
Kevin Blighe46k wrote:

Hey,

RPKM / FPKM are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

Cuffdiff is also old. HISAT2 / StringTie are the upgrades to the older TopHat / Cufflinks pipeline.

------------------------------------------

For instance gene A had a FPKM of 1000 in cell1 and FPKM of 90 in cell2, validation showed expression in cell1 but not cell2. However, gene B had a FPKM of 80 in cell1 and FPKM of 3 in cell2, and validation showed expression in both cell1 and 2, though it was stronger in cell1.

This is exactly the consequence of the normalisation process that produces FPKM counts: a value of 80 in one sample may mean something entirely different from 80 in another sample due to the way in which the data is normalised. In extreme cases, 80 could mean very high expression in one sample but virtually nil in the other. However, the statistical tests cannot make this distinction. This also has a direct consequence when setting minimal thresholds, i.e., for expressed / not expressed.

Unfortunately, FPKM data is still widely used and appears in publications, which is an argument always used to defend its usage by those who are unaware of its pitfalls.

----------------------------------------------

If you have RNA-seq data, then please use a better tool for differential expression analysis, like DESeq2, EdgeR, or LImma-Voom.

Kevin

ADD COMMENTlink modified 6 months ago • written 11 months ago by Kevin Blighe46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1671 users visited in the last hour