Question

Quantile normalizing prior to or after TPM scaling?

3

Entering edit mode

6.8 years ago

mforde84 ★ 1.4k

Somewhat simple question, when is the most advantageous point in a RNAseq analysis to perform quantile normalization for future between sample comparisons? Would it be at the raw count level since the documentation on quantile normalization appears to require raw data? Is quantile normalization expecting data distributed as a negative binomial? Does it matter if it's simply looking at some log based ranking ratio? Alternatively some TCGA approaches quantile normalized TPM values, though I thought this had some unwanted effect on the scaling of the TPM values? Also what data rows should be removed prior to normalization? E.g., rows with all zeros wouldn't affect TPM calculation, but it should have more or less effect with quantile normalization as the number of these events increases, right?

At the moment, I'm debating on two approaches:

rawcounts <- quantile normalize <- tpm <- filter out low TPM rows <- log2+1 tpm <- combat batch correction

compared to

rawcounts <- tpm <- filter out low TPM rows <- quantile normalize <- log2+1 quantile normalized TPM <- combat batch correction

Any suggestions would be appreciated. M

RNAseq normalization • 9.7k views

ADD COMMENT • link updated 18 months ago by Kevin Blighe 88k • written 6.8 years ago by mforde84 ★ 1.4k

0

Entering edit mode

I'm unsure why you are seemingly trying to re-invent the wheel here... (?). Limma/Voom will employ quantile normalisation on your raw counts, if you choose to use that.

I would not be confident of your data after passing it through batch correction, logging, and then quantile normalisation and TPM.

Why not keep this simple and feed your raw counts into Limma/Voom, edgeR, or DESeq2? I also recommend avoiding the use of ComBat at all costs. If you use DESeq2, include batch as a covariate in your design model.

You may want to take a look at this publication to understand the normalisation methods: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.

ADD REPLY • link 5.6 years ago by Kevin Blighe 88k

0

Entering edit mode

That's what I ended up doing. My main issue was with between sample normalizations of TPM, say if you don't have easy access to raw count data. Presumably the way I worded my post didn't get this across as well as it should have.

ADD REPLY • link 6.8 years ago by mforde84 ★ 1.4k

0

Entering edit mode

I see, yes, the TCGA raw data is controlled / level 1 access (BAM files) - presume you alluded to that. I was able to download raw HTseq RNA-seq counts recently, though (Level 3 / open access). I believe that TPM is part of the RSEM RNA-seqv2 data.

ADD REPLY • link 6.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Unfortunately it's not TCGA data. It's a proprietary dataset. Hypothetically, say I wasn't able to get raw counts, how would you suggest doing between subject normalizations for TPM? TCGA does upper quartile normalization on it's FPKM and TPM data, that just didn't seem to do enough for my data set after looking at the samplewise boxplots. Or maybe I'm just being too nitpicky. I've seen others suggest simply offset log2 TPM? Maybe it's not as powerful as count based approaches, but it seems to pick up the major players in the dataset.

ADD REPLY • link 6.8 years ago by mforde84 ★ 1.4k

1

Entering edit mode

As far as TPM is concerned, it is not ideal, at least according to these:

RPKM, FPKM, and TPM normalize away the most important factor for comparing samples, which is sequencing depth, whether directly or by accounting for the number of transcripts, which can differ significantly between samples. These approaches rely on normalizing methods that are based on total or effective counts, and tend to perform poorly when samples have heterogeneous transcript distributions, that is, when highly and differentially expressed features can skew the count distribution

There are also other Biostars threads, but where TPM is concerned, there never appears to be a conclusive answer:

Also, if you want to start/end your day badly, take a look at the response of the senior Bioconductor member here: https://support.bioconductor.org/p/98820/

In my opinion, there is no good way to do a DE analysis of RNA-seq data starting from the TPM values.

ADD REPLY • link 18 months ago by Kevin Blighe 88k

0

Entering edit mode

Did you get this sorted out?

ADD REPLY • link 6.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Sorry for the delay. Yes, I ended up getting things to work. It's just an interesting question regarding how to use TPM/FPKM properly. I'm not sure if there are some benchmarking studies that look at this against the SEQC standards. If there aren't, it would probably make a good paper. I mean what's the point of using TPM/FPKM if they aren't the most reliable indicators of between sample differences in expression?

ADD REPLY • link 6.8 years ago by mforde84 ★ 1.4k

0

Entering edit mode

Indeed, I'm now following up on threads and posting this:

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

ADD REPLY • link 5.9 years ago by Kevin Blighe 88k