Question: TCGA FPKM-UQ method theory
1
2.7 years ago by
zx12as342010
zx12as342010 wrote:

Hello everyone. Recently, I want to study TCGA data. I can't know the FPKM_UQ calculation formula relate with its depiction from TCGA website.

They say "The upper quartile FPKM (FPKM-UQ) is a modified FPKM calculation in which the total protein-coding read count is replaced by the 75th percentile read count value for the sample." https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

Please pioneers can provide me this method's reference. Or why that idea relate with formula?

Thank you.

rna-seq • 2.1k views
modified 2.7 years ago by i.sudbery10k • written 2.7 years ago by zx12as342010
1

Here is the page from GDC for FPKM-UQ.

Thank you response. My problem is biological concept of the formula. Why " the total protein-coding read count is replaced by the 75th percentile read count value for the sample."?

My understanding is, in statistics view, using 75th percentile read count value to normalize the sample will be less affected by the outliers.

But this just statistic define. why it is related with protein.

They introduced that part (the 'UQ' part) in order to "facilitate cross-sample comparison and differential expression analysis"

I know that many lecture show FPKM/RPKM are not good. I have a idea that the FPKM/RPKM are not like normalization for each sample. Right?

If you are just looking at a single sample, i.e., n=1, use of RPKM/FPKM units is generally fine. When you have n>1, the problem is that the normalisation method that produces RPKM/FPKM will normalise each sample differently, and the main parameter that affects this is the depth of coverage at which each sample was sequenced. So, a RPKM/FPKM expression value of 200 in one sample is not equivalent to 200 in another sample.

In theory, we can sequence 2 samples to the same target depth of coverage to overcome this; however, in practice, biases always exist and they will be sequenced at different depths.

Hope that this makes sense.

My opinion the same as you. I have use RMA, after FPKM or others. I think I must adjust sequence issue, then adjust batch effect within samples. Right? Thank you.

1

If you follow what Ian (i.sudbery) is saying and decide to use DESeq2, then you can adjust for batch effects by just including the batch variable in the design formula.

I also used RMA in the past for microarrays. It has taken some time for adequate normalisation methods for RNA-seq to be developed. DESeq2 and EdgeR are very popular, though.

Take a look here in order to get started with DESeq2: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#can-i-use-deseq2-to-analyze-paired-samples (batch is mentioned at the beginning, under Quick start)

2
2.7 years ago by
i.sudbery10k
Sheffield, UK
i.sudbery10k wrote:

The reason for the upper quartile normalisation is because of proportionality and sequencing real estate issues.

Consider two samples with the following number of transcripts per gene:

``````         |   A |   B |
gene 1 |  10 |  10 |
gene 2 |  10 |  10 |
gene 3 |  10 |  10 |
gene 4 |  70 | 170 |
``````

Now if we take 1 million reads from each sample we'll get the following read counts:

``````         |     A |    B |
gene 1 |  100k |  50k |
gene 2 |  100k |  50k |
gene 3 |  100k |  50k |
gene 4 |  700k | 850k |
``````

That is, the increase in expression of the highly expressed gene 4 has sucked sequencing real estate away from genes 1-3, even though they haven't actually increased in expression. This is not a freak accident: gene expression levels tend to be log normally expressed and so the top few genes will take up a large fraction of the reads in any experiment and even a small change in their expression could have major effects on the reads left for other genes. By excluding the most highly expressed genes when we calculate our normalisation factors, we partially avoid this effect.

This argument is possibly best laid out in Robinson et al, although they propose a different solution to upper quartile normalisation, one that only works in a differential expression context. Anders et al also go though it, again with their own conclusion on the best normalisation method. As far as I can tell the first reference for UQ normalisation in RNAseq is Bullard et al.

Thank you response. The reference is better for me. Which is your suggestion about normalization for RNA-SEQ?

Depends on what you want to use it for. If your main object of study is looking at how each gene varies between between samples, then I would use either DESeq or EdgeR normalised read counts. I'm pretty sure that read counts for TCGA are available somewhere. You would then model expression levels as a negative binomial. If you need something more homoskedastic, like for visualising clustering etc, then I'd use rlog transformed counts (see the rlog function in DESeq2, it also performs normalisation).

If you want to compare two genes within a sample then I would probably use TPM (transcripts per million). You could argue that you want to upper quartile normalise this (as TCGA did with their FPKMs), but if you comparisons are purely within sample, then it won't make any difference.

I want to compare genes within different samples. I think RNA-SEQ have to use normalization, like microarray analysis. Having to adjust batch effect. I think FPKM, RPKM, TPM, FPKM_UQ method can't compare within samples. I am not sure why GDC not provide others analysis method. What is your recommend? Thank you.

The rlog function from DESeq will allow correction for batch effects I think, this takes read counts. And limma has a "removeBatchEffects" function, it would take your FPKM_UQ numbers, although its not ideal. .

For cross-sample comparisons, as Ian implies, FPKM, FPKM-UQ, TPM, etc., are not suitable.

For batch effect adjustment in DESeq2, just include the `batch` variable in your design model formula, as I mention in my other comment above.

rlog, which Ian mentioned, is a way to get your normalised data into a distribution more amenable to most downstream methods (e.g. boxplots, PCA, heatmaps, clustering, etc.). The statistical comparisons in DESeq2 are performed on the negative binomial normalised counts, though, via the Wald Test.

You can only include batch in your design model if you are going to do differential expression.

Yes, including batch in the design model will not actually adjust the raw/normalised counts; however, it will include batch in the negative binomial GLM that is fit to these counts, with statistical inferences adjusted accordingly.

I have find the RSEM, TMM. What can I estimate which methods adjust batch better? It's the boxplots are similar.