Question: TCGA FPKM-UQ method theory
1
5 months ago by
zx12as342010
zx12as342010 wrote:

Hello everyone. Recently, I want to study TCGA data. I can't know the FPKM_UQ calculation formula relate with its depiction from TCGA website.

They say "The upper quartile FPKM (FPKM-UQ) is a modified FPKM calculation in which the total protein-coding read count is replaced by the 75th percentile read count value for the sample." https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

Please pioneers can provide me this method's reference. Or why that idea relate with formula?

Thank you.

rna-seq • 549 views
modified 5 months ago by i.sudbery2.6k • written 5 months ago by zx12as342010
1

Here is the page from GDC for FPKM-UQ.

Thank you response. My problem is biological concept of the formula. Why " the total protein-coding read count is replaced by the 75th percentile read count value for the sample."?

My understanding is, in statistics view, using 75th percentile read count value to normalize the sample will be less affected by the outliers.

But this just statistic define. why it is related with protein.

They introduced that part (the 'UQ' part) in order to "facilitate cross-sample comparison and differential expression analysis"

Although they state this, in practice, FPKM and FPKM-UQ are inadequate for conducting robust differential expression comparisons. You should aim to obtain raw counts via Kallisto, Salmon, RSEM, or HTseq and then re-proces these in a more robust DE analysis tool, like EdgeR, DESeq2, EBseq, etc.

Also read this: C: the problem with rpkm (and tpm)

I know that many lecture show FPKM/RPKM are not good. I have a idea that the FPKM/RPKM are not like normalization for each sample. Right?

The normalisation method is 'okay' if you are just looking at a single sample, i.e., n=1, but some people even doubt it as a normalisation method in that situation. When you have n>1, i.e., more than 1 sample in your study, the problem is that the normalisation method will normalise each sample in a different way, and the main parameter that affects this is the target depth of coverage at which each sample was sequenced. So, a RPKM/FPKM expression value of 200 in one sample is not equivalent to 200 in another sample because they are normalised differently.

In theory, we can sequence 2 samples to the same target depth of coverage; however, in practice, biases always exist and they will be sequenced at different depths.

Hope that this makes sense.

Unfortunately, we now have the situation where a lot of data is out there as FPKM, including TCGA data. Many people with non-statistical / bioinformatic backgrounds use this data without realising he pitfalls of using it.

My opinion the same as you. I have use RMA, after FPKM or others. I think I must adjust sequence issue, then adjust batch effect within samples. Right? Thank you.

1

If you follow what Ian (i.sudbery) is saying and decide to use DESeq2, then you can adjust for batch effects by just including the batch variable in the design formula.

I also used RMA in the past for microarrays. It has taken some time for adequate normalisation methods for RNA-seq t be developed. DESeq2 and EdgeR are very popular, though.

Take a look here in order to get started with DESeq2: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#can-i-use-deseq2-to-analyze-paired-samples (batch is mentioned at the beginning, under Quick start)

I've also posted some other stuff about DESeq2's normalisation strategies:

2
5 months ago by
i.sudbery2.6k
Sheffield, UK
i.sudbery2.6k wrote:

The reason for the upper quartile normalisation is because of proportionality and sequencing real estate issues.

Consider two samples with the following number of transcripts per gene:

``````         |   A |   B |
gene 1 |  10 |  10 |
gene 2 |  10 |  10 |
gene 3 |  10 |  10 |
gene 4 |  70 | 170 |
``````

Now if we take 1 million reads from each sample we'll get the following read counts:

``````         |     A |    B |
gene 1 |  100k |  50k |
gene 2 |  100k |  50k |
gene 3 |  100k |  50k |
gene 4 |  700k | 850k |
``````

That is, the increase in expression of the highly expressed gene 4 has sucked sequencing real estate away from genes 1-3, even though they haven't actually increased in expression. This is not a freak accident: gene expression levels tend to be log normally expressed and so the top few genes will take up a large fraction of the reads in any experiment and even a small change in their expression could have major effects on the reads left for other genes. By excluding the most highly expressed genes when we calculate our normalisation factors, we partially avoid this effect.

This argument is possibly best laid out in Robinson et al, although they propose a different solution to upper quartile normalisation, one that only works in a differential expression context. Anders et al also go though it, again with their own conclusion on the best normalisation method. As far as I can tell the first reference for UQ normalisation in RNAseq is Bullard et al.

Thank you response. The reference is better for me. Which is your suggestion about normalization for RNA-SEQ?

Depends on what you want to use it for. If your main object of study is looking at how each gene varies between between samples, then I would use either DESeq or EdgeR normalised read counts. I'm pretty sure that read counts for TCGA are available somewhere. You would then model expression levels as a negative binomial. If you need something more homoskedastic, like for visualising clustering etc, then I'd use rlog transformed counts (see the rlog function in DESeq2, it also performs normalisation).

If you want to compare two genes within a sample then I would probably use TPM (transcripts per million). You could argue that you want to upper quartile normalise this (as TCGA did with their FPKMs), but if you comparisons are purely within sample, then it won't make any difference.

I want to compare genes within different samples. I think RNA-SEQ have to use normalization, like microarray analysis. Having to adjust batch effect. I think FPKM, RPKM, TPM, FPKM_UQ method can't compare within samples. I am not sure why GDC not provide others analysis method. What is your recommend? Thank you.

The rlog function from DESeq will allow correction for batch effects I think, this takes read counts. And limma has a "removeBatchEffects" function, it would take your FPKM_UQ numbers, although its not ideal. .

For cross-sample comparisons, as Ian implies, FPKM, FPKM-UQ, TPM, etc., are not suitable. I don't know why the TCGA made the data available in that format. A big question to the NIH.

For batch effect adjustment in DESeq2, just include the `batch` variable in your design model formula, as I mention in my other comment above.

rlog, which Ian mentioned, is a way to get your data on a [roughly] binomial distribution, which is useful for plotting functions like boxplots, PCA, heatmaps, clustering, etc. The statistical comparisons in DESeq2 are performed on the negative binomial normalised counts, though, via the Wald Test.

You can only include batch in your design model if you are going to do differential expression.

Yes, including batch in the design model will not actually adjust the raw/normalised counts; however, it will include batch in the negative binomial GLM that is fit to these counts, with statistical inferences adjusted accordingly.

I have find the RSEM, TMM. What can I estimate which methods adjust batch better? It's the boxplots are similar.

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

# 1

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

# 2

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.