Question

TCGA FPKM-UQ method theory

2

Entering edit mode

6.0 years ago

zx12as3420 ▴ 20

Hello everyone. Recently, I want to study TCGA data. I can't know the FPKM_UQ calculation formula relate with its depiction from TCGA website.

They say "The upper quartile FPKM (FPKM-UQ) is a modified FPKM calculation in which the total protein-coding read count is replaced by the 75th percentile read count value for the sample." https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

Please pioneers can provide me this method's reference. Or why that idea relate with formula?

Thank you.

RNA-Seq • 4.3k views

ADD COMMENT • link updated 6.0 years ago by i.sudbery 19k • written 6.0 years ago by zx12as3420 ▴ 20

1

Entering edit mode

Here is the page from GDC for FPKM-UQ.

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

Thank you response. My problem is biological concept of the formula. Why " the total protein-coding read count is replaced by the 75th percentile read count value for the sample."?

ADD REPLY • link 6.0 years ago by zx12as3420 ▴ 20

0

Entering edit mode

My understanding is, in statistics view, using 75th percentile read count value to normalize the sample will be less affected by the outliers.

ADD REPLY • link 6.0 years ago by shoujun.gu ▴ 380

0

Entering edit mode

But this just statistic define. why it is related with protein.

ADD REPLY • link 6.0 years ago by zx12as3420 ▴ 20

0

Entering edit mode

They introduced that part (the 'UQ' part) in order to "facilitate cross-sample comparison and differential expression analysis"

[source: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization/high-level-data-generation/rna-seq-quantification]

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

0

Entering edit mode

I know that many lecture show FPKM/RPKM are not good. I have a idea that the FPKM/RPKM are not like normalization for each sample. Right?

ADD REPLY • link 6.0 years ago by zx12as3420 ▴ 20

1

Entering edit mode

If you are just looking at a single sample, i.e., n=1, use of RPKM/FPKM units is generally fine. When you have n>1, the problem is that the normalisation method that produces RPKM/FPKM will normalise each sample differently, and the main parameter that affects this is the depth of coverage at which each sample was sequenced. So, a RPKM/FPKM expression value of 200 in one sample is not equivalent to 200 in another sample.

In theory, we can sequence 2 samples to the same target depth of coverage to overcome this; however, in practice, biases always exist and they will be sequenced at different depths.

Hope that this makes sense.

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

0

Entering edit mode

My opinion the same as you. I have use RMA, after FPKM or others. I think I must adjust sequence issue, then adjust batch effect within samples. Right? Thank you.

ADD REPLY • link 6.0 years ago by zx12as3420 ▴ 20

1

Entering edit mode

If you follow what Ian (i.sudbery) is saying and decide to use DESeq2, then you can adjust for batch effects by just including the batch variable in the design formula.

I also used RMA in the past for microarrays. It has taken some time for adequate normalisation methods for RNA-seq to be developed. DESeq2 and EdgeR are very popular, though.

Take a look here in order to get started with DESeq2: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#can-i-use-deseq2-to-analyze-paired-samples (batch is mentioned at the beginning, under Quick start)

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

score 3 · Answer 1 · 2018-05-11

3

Entering edit mode

6.0 years ago

i.sudbery 19k

The reason for the upper quartile normalisation is because of proportionality and sequencing real estate issues.

Consider two samples with the following number of transcripts per gene:

         |   A |   B |
  gene 1 |  10 |  10 |
  gene 2 |  10 |  10 |
  gene 3 |  10 |  10 |
  gene 4 |  70 | 170 |

Now if we take 1 million reads from each sample we'll get the following read counts:

         |     A |    B |
  gene 1 |  100k |  50k |
  gene 2 |  100k |  50k |
  gene 3 |  100k |  50k |
  gene 4 |  700k | 850k |

That is, the increase in expression of the highly expressed gene 4 has sucked sequencing real estate away from genes 1-3, even though they haven't actually increased in expression. This is not a freak accident: gene expression levels tend to be log normally expressed and so the top few genes will take up a large fraction of the reads in any experiment and even a small change in their expression could have major effects on the reads left for other genes. By excluding the most highly expressed genes when we calculate our normalisation factors, we partially avoid this effect.

This argument is possibly best laid out in Robinson et al, although they propose a different solution to upper quartile normalisation, one that only works in a differential expression context. Anders et al also go though it, again with their own conclusion on the best normalisation method. As far as I can tell the first reference for UQ normalisation in RNAseq is Bullard et al.

ADD COMMENT • link 6.0 years ago by i.sudbery 19k

0

Entering edit mode

Thank you response. The reference is better for me. Which is your suggestion about normalization for RNA-SEQ?

ADD REPLY • link 6.0 years ago by zx12as3420 ▴ 20

0

Entering edit mode

Depends on what you want to use it for. If your main object of study is looking at how each gene varies between between samples, then I would use either DESeq or EdgeR normalised read counts. I'm pretty sure that read counts for TCGA are available somewhere. You would then model expression levels as a negative binomial. If you need something more homoskedastic, like for visualising clustering etc, then I'd use rlog transformed counts (see the rlog function in DESeq2, it also performs normalisation).

If you want to compare two genes within a sample then I would probably use TPM (transcripts per million). You could argue that you want to upper quartile normalise this (as TCGA did with their FPKMs), but if you comparisons are purely within sample, then it won't make any difference.

ADD REPLY • link 6.0 years ago by i.sudbery 19k

0

Entering edit mode

I want to compare genes within different samples. I think RNA-SEQ have to use normalization, like microarray analysis. Having to adjust batch effect. I think FPKM, RPKM, TPM, FPKM_UQ method can't compare within samples. I am not sure why GDC not provide others analysis method. What is your recommend? Thank you.

ADD REPLY • link 6.0 years ago by zx12as3420 ▴ 20

0

Entering edit mode

The rlog function from DESeq will allow correction for batch effects I think, this takes read counts. And limma has a "removeBatchEffects" function, it would take your FPKM_UQ numbers, although its not ideal. .

ADD REPLY • link 6.0 years ago by i.sudbery 19k

0

Entering edit mode

For cross-sample comparisons, as Ian implies, FPKM, FPKM-UQ, TPM, etc., are not suitable.

For batch effect adjustment in DESeq2, just include the batch variable in your design model formula, as I mention in my other comment above.

rlog, which Ian mentioned, is a way to get your normalised data into a distribution more amenable to most downstream methods (e.g. boxplots, PCA, heatmaps, clustering, etc.). The statistical comparisons in DESeq2 are performed on the negative binomial normalised counts, though, via the Wald Test.

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

0

Entering edit mode

You can only include batch in your design model if you are going to do differential expression.

ADD REPLY • link 6.0 years ago by i.sudbery 19k

0

Entering edit mode

Yes, including batch in the design model will not actually adjust the raw/normalised counts; however, it will include batch in the negative binomial GLM that is fit to these counts, with statistical inferences adjusted accordingly.

ADD REPLY • link 6.0 years ago by Kevin Blighe 87k

0

Entering edit mode

I have find the RSEM, TMM. What can I estimate which methods adjust batch better? It's the boxplots are similar.

ADD REPLY • link 6.0 years ago by zx12as3420 ▴ 20