Question: TCGA FPKM-UQ method theory
1
gravatar for zx12as3420
2.7 years ago by
zx12as342010
zx12as342010 wrote:

Hello everyone. Recently, I want to study TCGA data. I can't know the FPKM_UQ calculation formula relate with its depiction from TCGA website.

They say "The upper quartile FPKM (FPKM-UQ) is a modified FPKM calculation in which the total protein-coding read count is replaced by the 75th percentile read count value for the sample." https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

Please pioneers can provide me this method's reference. Or why that idea relate with formula?

Thank you.

rna-seq • 2.1k views
ADD COMMENTlink modified 2.7 years ago by i.sudbery10k • written 2.7 years ago by zx12as342010
1

Here is the page from GDC for FPKM-UQ.

ADD REPLYlink written 2.7 years ago by GenoMax95k

Thank you response. My problem is biological concept of the formula. Why " the total protein-coding read count is replaced by the 75th percentile read count value for the sample."?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by zx12as342010

My understanding is, in statistics view, using 75th percentile read count value to normalize the sample will be less affected by the outliers.

ADD REPLYlink written 2.7 years ago by shoujun.gu370

But this just statistic define. why it is related with protein.

ADD REPLYlink written 2.7 years ago by zx12as342010

They introduced that part (the 'UQ' part) in order to "facilitate cross-sample comparison and differential expression analysis"

[source: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization/high-level-data-generation/rna-seq-quantification]

ADD REPLYlink modified 24 months ago • written 2.7 years ago by Kevin Blighe69k

I know that many lecture show FPKM/RPKM are not good. I have a idea that the FPKM/RPKM are not like normalization for each sample. Right?

ADD REPLYlink written 2.7 years ago by zx12as342010

If you are just looking at a single sample, i.e., n=1, use of RPKM/FPKM units is generally fine. When you have n>1, the problem is that the normalisation method that produces RPKM/FPKM will normalise each sample differently, and the main parameter that affects this is the depth of coverage at which each sample was sequenced. So, a RPKM/FPKM expression value of 200 in one sample is not equivalent to 200 in another sample.

In theory, we can sequence 2 samples to the same target depth of coverage to overcome this; however, in practice, biases always exist and they will be sequenced at different depths.

Hope that this makes sense.

ADD REPLYlink modified 24 months ago • written 2.7 years ago by Kevin Blighe69k

My opinion the same as you. I have use RMA, after FPKM or others. I think I must adjust sequence issue, then adjust batch effect within samples. Right? Thank you.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by zx12as342010
1

If you follow what Ian (i.sudbery) is saying and decide to use DESeq2, then you can adjust for batch effects by just including the batch variable in the design formula.

I also used RMA in the past for microarrays. It has taken some time for adequate normalisation methods for RNA-seq to be developed. DESeq2 and EdgeR are very popular, though.

Take a look here in order to get started with DESeq2: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#can-i-use-deseq2-to-analyze-paired-samples (batch is mentioned at the beginning, under Quick start)

ADD REPLYlink modified 24 months ago • written 2.7 years ago by Kevin Blighe69k
2
gravatar for i.sudbery
2.7 years ago by
i.sudbery10k
Sheffield, UK
i.sudbery10k wrote:

The reason for the upper quartile normalisation is because of proportionality and sequencing real estate issues.

Consider two samples with the following number of transcripts per gene:

         |   A |   B |
  gene 1 |  10 |  10 |
  gene 2 |  10 |  10 |
  gene 3 |  10 |  10 |
  gene 4 |  70 | 170 |

Now if we take 1 million reads from each sample we'll get the following read counts:

         |     A |    B |
  gene 1 |  100k |  50k |
  gene 2 |  100k |  50k |
  gene 3 |  100k |  50k |
  gene 4 |  700k | 850k |

That is, the increase in expression of the highly expressed gene 4 has sucked sequencing real estate away from genes 1-3, even though they haven't actually increased in expression. This is not a freak accident: gene expression levels tend to be log normally expressed and so the top few genes will take up a large fraction of the reads in any experiment and even a small change in their expression could have major effects on the reads left for other genes. By excluding the most highly expressed genes when we calculate our normalisation factors, we partially avoid this effect.

This argument is possibly best laid out in Robinson et al, although they propose a different solution to upper quartile normalisation, one that only works in a differential expression context. Anders et al also go though it, again with their own conclusion on the best normalisation method. As far as I can tell the first reference for UQ normalisation in RNAseq is Bullard et al.

ADD COMMENTlink written 2.7 years ago by i.sudbery10k

Thank you response. The reference is better for me. Which is your suggestion about normalization for RNA-SEQ?

ADD REPLYlink written 2.7 years ago by zx12as342010

Depends on what you want to use it for. If your main object of study is looking at how each gene varies between between samples, then I would use either DESeq or EdgeR normalised read counts. I'm pretty sure that read counts for TCGA are available somewhere. You would then model expression levels as a negative binomial. If you need something more homoskedastic, like for visualising clustering etc, then I'd use rlog transformed counts (see the rlog function in DESeq2, it also performs normalisation).

If you want to compare two genes within a sample then I would probably use TPM (transcripts per million). You could argue that you want to upper quartile normalise this (as TCGA did with their FPKMs), but if you comparisons are purely within sample, then it won't make any difference.

ADD REPLYlink written 2.7 years ago by i.sudbery10k

I want to compare genes within different samples. I think RNA-SEQ have to use normalization, like microarray analysis. Having to adjust batch effect. I think FPKM, RPKM, TPM, FPKM_UQ method can't compare within samples. I am not sure why GDC not provide others analysis method. What is your recommend? Thank you.

ADD REPLYlink written 2.7 years ago by zx12as342010

The rlog function from DESeq will allow correction for batch effects I think, this takes read counts. And limma has a "removeBatchEffects" function, it would take your FPKM_UQ numbers, although its not ideal. .

ADD REPLYlink written 2.7 years ago by i.sudbery10k

For cross-sample comparisons, as Ian implies, FPKM, FPKM-UQ, TPM, etc., are not suitable.

For batch effect adjustment in DESeq2, just include the batch variable in your design model formula, as I mention in my other comment above.

rlog, which Ian mentioned, is a way to get your normalised data into a distribution more amenable to most downstream methods (e.g. boxplots, PCA, heatmaps, clustering, etc.). The statistical comparisons in DESeq2 are performed on the negative binomial normalised counts, though, via the Wald Test.

ADD REPLYlink modified 24 months ago • written 2.7 years ago by Kevin Blighe69k

You can only include batch in your design model if you are going to do differential expression.

ADD REPLYlink written 2.7 years ago by i.sudbery10k

Yes, including batch in the design model will not actually adjust the raw/normalised counts; however, it will include batch in the negative binomial GLM that is fit to these counts, with statistical inferences adjusted accordingly.

ADD REPLYlink written 2.7 years ago by Kevin Blighe69k

I have find the RSEM, TMM. What can I estimate which methods adjust batch better? It's the boxplots are similar.

ADD REPLYlink written 2.7 years ago by zx12as342010
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1465 users visited in the last hour
_