Question: Metric to use for RNAseq expression matrix
3
gravatar for RamRS
6 weeks ago by
RamRS24k
Houston, TX
RamRS24k wrote:

Hi,

TL;DR: What is the best metric for an expression matrix with RNAseq data obtained in multiple batches?

Background

I'm working on a project where we're compiling RNAseq data on a bunch of human samples. The data has been accumulating for the past few years and will continue to come in. The data is from different sequencing platforms (HiSeq/NovaSeq) and has other differences (such as the fact that some of them were multiplexed while others were not, essentially giving me anywhere between 2-16 PE FASTQ files per sample). The raw data is also multi-species (host+graft) and gets separated before being quantified with STAR+RSEM.

Objective

My question here is: How do I create one matrix that allows cross-sample comparison of gene expression?

Challenges

First off, FPKM and RPKM are off the table because they don't allow for comparison across samples. TPM is an option as it does allow some sort of comparison across samples, although there exists some debate. Quantile/Upper Quartile normalization is another option.

Then of course is the issue of batch effects. An option to deal with that is sva::ComBat(), which "removes" batch effects and gives me a batch-effect-subtracted matrix. However, it does seem to affect my data weirdly, producing negative values for count data, which doesn't make any sense. Because there is no DE question, using DESeq2 doesn't make a lot of sense to me (unless DESeq2 can somehow export a TPM or count matrix with batch effect corrected)

Discussion

What does the community think is the best way to go about doing this? What is the metric to use (TPM / counts) and what sort of normalization (Quantile Normalization / UQ Normalization / something like rlog - however that would apply)? And how do I account for the batch effect? An ideal approach would allow me to add a new set of samples without altering existing samples' values, but that might be asking for too much.

Afterthought

I'd settle on UQ-normalized counts "because TCGA does it", but that doesn't seem a satisfactory reason to adopt that approach, especially given the fact that UQ-normalization does not really help with batch effect (why are TCGA even using it?)

ADD COMMENTlink modified 6 weeks ago by i.sudbery5.3k • written 6 weeks ago by RamRS24k

I haven't read this in a while, but the article by Gilad & Mizrahi-Man discussing batch effects of the modENCODE datasets may contain some relevant approaches

ADD REPLYlink written 6 weeks ago by Friederike5.1k
3
gravatar for i.sudbery
6 weeks ago by
i.sudbery5.3k
Sheffield, UK
i.sudbery5.3k wrote:

In the past we have used rlog from DESeq2 and then removed batch effects with removeBatchEffects from limma.

This will give you a matrix than can be used for many things, particularly visualization and clustering. However for many specific analyses there are probably better ways (which may be specific to the specific analysis).

ADD COMMENTlink written 6 weeks ago by i.sudbery5.3k

Hi i.sudbey,

is there any difference by using removeBatchEffect from limma or using the batch option in the design of DESeq2? also, the only purpose of rlog is the visualization like clustering but they can't be used for comparing gene expression profiling isn't it?

Thank you

ADD REPLYlink written 6 weeks ago by Morris_Chair120
2

You can't use batch in the design of DESeq2 here because no differential expression analysis is happening and so there is no place for the variance due to batch to be accounted for.

Rlog can't be used (or at least is very suboptimal for use in ) differential expression analysis, however, I don't believe this is what @RamRS wants to do. rlog is also bad for comparing two genes from the same sample.

There are many other analyses, where one can imagine algorithms being designed that are based on Negative Bionomial assumptions, but often things are not, they are based on Gaussian assumptions. In which case rlog would probably perform better. It'll definitely perform better than simple log2(counts+1).

ADD REPLYlink written 6 weeks ago by i.sudbery5.3k

rlog is also bad for comparing two genes from the same sample

Wouldn't this mean that comparing a set of genes across samples won't work with rlog? The 2D matrix of genes by samples would not be comparable along one of its dimensions, right?

ADD REPLYlink written 6 weeks ago by RamRS24k

Depends how the comparison is done. rlog suffers from the same problem counts do here: that it is affected by the length of the gene, the %GC of the gene etc etc. When comparing the same gene between samples, these things are accounted for, but if you want to compare genes within a sample these gene specific confounders will have an effect. I'm not really aware of any measure that doesn't have this problem. The best are probably FPKM and TPM - but obviously those have there own problems, especially FPKM, which is essentially incomparable between samples. But even then, there only account for length, not %GC, splicing complexity, mappability etc.

ADD REPLYlink written 6 weeks ago by i.sudbery5.3k

I'm OK with just accounting for length. Does it make sense to do TPM + removeBatchEffect instead of counts + removeBatchEffect? Or even rlog(TPM) > removeBatchEffect? All of this of course only helps if removeBatchEffect does not do what ComBat does (over-correction resulting in negative count/TPM values).

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by RamRS24k

There are two problems with TPM -> removeBatchEffect:

  1. Normalisation is by library size, rather than DESeq normalisation, so you could have compositional problems.
  2. it doesn't account for the extra variance at low read numbers. I have considered rlog(TPM) before, or alternatively TPM(rlog), but I've never got a good answer from anyone on if either of these things are valid.

In a genes * samples matrix, column clusterting will work fine on rlog/vst. Row clustering will also be fine if the data is centred/row normed, or a correlative distance measure is used.

As for removeBatchEffects over correcting, I don't know, because we've always done it on a log-scale measure, and so negative values are fine.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by i.sudbery5.3k

This is indeed a dilemma - capturing differences between gene lengths/composition metrics is not possible with counts/rlog. I do need to be able to pick "zero-expression genes" in the matrix, which is pretty simple if count is 0, but I don't know how those can be done after an rlog + removeBatchEffect - given that removeBatchEffect could transform those 0s to anything based on the batch they're from.

ADD REPLYlink written 6 weeks ago by RamRS24k

I think you'd probably find that rlog also makes it hard to pick out 0 genes, because, if I understand correctly, part of the theory behind variance homogenising transformations like rlog and VST, is that because variance is proportionally higher at low counts, and because the discretness of counts gets in the way of accurate quantification, you can't really trust a zero to be zero, or rather, the difference between no counts and 1 count is not particularly meaningful.

ADD REPLYlink written 6 weeks ago by i.sudbery5.3k

From what I understand this is especially prominent in rlog as it shrinks the genes with high variability but low counts towards zero, which is not the case with vst.

ADD REPLYlink written 6 weeks ago by ATpoint23k

Visualization and clustering are the principal drivers for this project, so this might just work. Thank you!

ADD REPLYlink written 6 weeks ago by RamRS24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 692 users visited in the last hour