Question: Best-suited unit of expression to perform between-sample comparison
gravatar for leaodel
2.9 years ago by
leaodel130 wrote:

Dear biostars community,

I'm having some doubts regarding which measure is ideal for reporting expression values and I thought you could help me with your experience.

I've been dealing with RNA-seq data from two projects now (one single end and other project paired-end reads) and I want to choose an expression measurement suitable to compare samples from different experiments.

From literature (I dig a lot into blogs, papers, etc.. ) and essentially I've summed up the following:

  1. Both RPKM and FPKM measures shouldn't be used anymore since they contain an essentially arbitrary scaling factor which is dependent on the average effective length of the transcripts in the underlying sample. Not reproducible, not comparable...
  2. TPM measure seems to be more appropriate in dealing with this issue since the sums of normalized reads of each sample are the same across all samples, making it "more suitable" to compare samples. However, its calculation (specifically the denominator term) is also sample dependent and this would be the main reason why I shouldn't use it to directly compare expression values between samples.
  3. CPM seems to be a less-normalized measure since it takes into account only library size. On the opposite hand, estimated read count don't normalize samples at all, making it useless to my goal (unless I use some between-sample normalization method).

My point is that TPM seems to be the most reliable expression measurement to compare different samples. Still, TMP performs within-sample normalization (although there's a lot of papers comparing samples based on TPM values).

Do you think TPM is suitable to compare between-samples expression values? If not, which method you would recommend? Should I use any between-sample normalization method?

I'm looking forward to hearing your opinion! Thanks in advance!

ADD COMMENTlink modified 2.9 years ago by Carlo Yague5.7k • written 2.9 years ago by leaodel130
gravatar for Carlo Yague
2.9 years ago by
Carlo Yague5.7k
Carlo Yague5.7k wrote:

TPM is not suitable for between-sample normalization because it doesn't account for differences in library composition. It is also very dependent on a few highly expressed genes that may not be the same between your samples.

Instead, you could use the TMM-normalized counts or the median of ratio normalization used in DESeq2.

ADD COMMENTlink written 2.9 years ago by Carlo Yague5.7k

Could you not use DESeq2 to normalize the counts to library size before calculating the TPM to make it comparable across samples?

ADD REPLYlink written 20 months ago by reilly.brian.m60

Yes, this is the median of ratio method I refereed to (DESeq2 uses that).

ADD REPLYlink written 20 months ago by Carlo Yague5.7k

What if I need something accounts for gene length and intersample differences? Could I do between sample normalization, like TMM, on TPM values?

ADD REPLYlink written 19 months ago by darklings200

In general, I think that if one is interested in intersample differences, then bias such as gene length does not matter because it affects similarly both samples. Nevertheless, if you really need to correct for gene length, I would do the opposite than what you suggest: first do between sample normalization (like TMM or median of ratio), and only then scale the counts of every gene by dividing by read length. By doing that, you keep your counts normalized (the median of ratio does not change). Note that I wouldn't use that metric for any kind of statistical analysis, as the counts for short genes will be proportionally inflated.

ADD REPLYlink modified 19 months ago • written 19 months ago by Carlo Yague5.7k

To obtain an expression matrix I normalized raw counts with the median ratio method from DESeq2 and then calculated counts per million with their fpm() function.

ADD REPLYlink written 19 months ago by leaodel130

should be the cpm() function?

ADD REPLYlink written 19 months ago by darklings200

Nop, the function is called fpm which stands for fragments per million. If you have counts and use this function you'll end up with cpm.

ADD REPLYlink written 19 months ago by leaodel130

Hi Carlo,

I hope it's ok that I add my question to this post.

What normalization would you then recommend for doing the within-sample statistical analysis? I have TMM and gene length normalized counts and I am interested in comparing the counts between 3 genes within one sample. Can these only be visually/descriptively compared and not statistically?

Thanks, Morgan

ADD REPLYlink written 7 weeks ago by mss40

You should use TPM values for within sample comparison.

ADD REPLYlink written 6 weeks ago by spaladug10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1643 users visited in the last hour