Question: hclust heatmap clustering produce different relationships with fpkm versus rlog transformed coun
0
gravatar for nancy
11 weeks ago by
nancy60
nancy60 wrote:

I have a data set of n=3 treated vs control RNA-seq. I used hisat2 for alignment followed by stringtie to compute fpkms, and then prepDE.py to convert the fpkms to count data, for analysis for D.E. using DESEQ2.

When I plot dendograms and heatmaps using hclust(ward), the results are different depending upon whether the input is rlog counts or log2fpkm.

https://ibb.co/mqGAFy

I was wondering if anyone could help me interpret this- especially since 1 dendogram reflects the control vs treated grouping better. Thank you.

ADD COMMENTlink modified 11 weeks ago by Kevin Blighe25k • written 11 weeks ago by nancy60

An update (12th August 2018):

You should abandon RPKM / FPKM normalisation. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis: Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

In their key points:

The Total Count and RPKM normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Note - FPKM is essentially the same as RPKM

ADD REPLYlink written 5 days ago by Kevin Blighe25k
2
gravatar for Kevin Blighe
11 weeks ago by
Kevin Blighe25k
USA / Europe / Brazil
Kevin Blighe25k wrote:

Dear nancy,

It makes perfect sense that the dendrograms would be different for the following 2 main reasons:

  1. FPKM and DESeq2's normalisation methods are different: FPKM does not sufficiently adjust for differences in library size across your samples; DESeq2 does [account for difference]
  2. log (base 2) is not the same transformation as the regularisd log transformation of DESeq2

In addition, looking at your dendrograms, I do not see good separation between control and treated. For your FPKM data, I see a 'flat' and 'structureless' dendrogram, which is what I'd expect from logged FPKM. On the other hand, I see a well-structured dendrogram for your rlog data, which is what I expect in knowing this data transformation method.

Something to keep in mind: when clustering, Euclidean distance should only be used for binomially-distributed data. If you want to cluster FPKM data, you should probably at least use 1 minus Pearson/Spearman correlation distance.

Kevin

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by Kevin Blighe25k

Hi Kevin,

Thank you very much for your helpful response. Firstly I need to apologize, turns out the previous hclust dendograms were not generated using the exact same set of data.

When repeated with the correct files, they still look different.

https://ibb.co/denjAy

now I understand from your answer that the inherent normalizations for fpkm and counts may contribute to it. Indeed, the samples dont separate well to start with so that is perhaps the biggest reason that the separations always look noisy, when comparing data for >15000 data points.

  1. I always thought (going by the name fpkm), it was well normalized for # of reads etc.
  2. I was unsure whether when we use prepDE to convert fpkm to counts, the same normalization as running default DESeq2 applies
  3. May I ask, since rlog transformed count data from DESeq2 should reflect a neg binomial distribution, ward is OK to use as correlation distance?

Thanks alot nancy

ADD REPLYlink written 11 weeks ago by nancy60

Hey nancy,

Thanks for the updates. So, it's the correlation distance that contributes to the 'flat' dendrograms. As I think about it, it makes sense because using correlation distance on a small number of samples will not give much reliable values, when you think about it. How can I faithfully correlate 6 samples against each other? There will not be much variation in the correlation distance values, and therefore not much variation in the dendrogram structure.

In saying that, it makes sense that correlation distance produces the same dendrogram between log2FPKM and the rlog counts because correlation, the statistical measure, is 'mostly' independent of the library size difference that I mentioned earlier (if you think about it, correlation just looks for what increases and decreases together). This is why FPKM is fine for things like co-expression analysis, as in network analysis (think of WGCNA), but not differential expression analysis.

I don't know much about the prepDE function of StringTie, so, I don't know how it re-produces counts suitable for DESeq2. My preference is always to get raw counts and use those for DESeq2. If I need to use HISAT2 / StringTie, it's only to generate a custom GTF file, which I then re-use on my FASTQs or BAMs for the purpose of deriving raw counts suitable for DESeq2. That said, using prepDE for DESeq2 should be fine - I believe that DESeq2 may even have a specific function for reading in and transforming FPKM data, but you can check in the vignette (vignette("DESeq2")).

The rlog data will actually mostly reflect a binomial distribution. However, it is produced from the normalised counts, which are measured on a negative binomial distribution. In this sense, Euclidean distance and Ward's linkage are fine for rlog data. For log2FPKM in clustering, perhaps Euclidean distance and correlation distance would be more reliable (as Spearman).

PS - make sure that you use ward.D2 and not ward or ward.D in the hclust() function. There's an issue with ward:

Two different algorithms are found in the literature for Ward clustering. The one used by option "ward.D" (equivalent to the only Ward option "ward" in R versions <= 3.0.3) does not implement Ward's (1963) clustering criterion, whereas option "ward.D2" implements that criterion (Murtagh and Legendre 2014).

[source: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html]

ADD REPLYlink written 11 weeks ago by Kevin Blighe25k

PS - why do you have 15,000 data-points? Are you plotting all transcripts that passed QC? Did you do any filtering for differentially expressed transcripts? If you plot the entire dataset in clustering, you will rarely see 'natural' segregation between your groups.

ADD REPLYlink written 11 weeks ago by Kevin Blighe25k

Hi Kevin,

"In saying that, it makes sense that correlation distance produces the same dendrogram between log2FPKM and the rlog counts because correlation, the statistical measure, is 'mostly' independent of the library size difference that I mentioned earlier (if you think about it, correlation just looks for what increases and decreases together). This is why FPKM is fine for things like co-expression analysis, as in network analysis (think of WGCNA), but not differential expression analysis."

1.I guess the dendograms are not 100% identical between fpkm and counts when Ward is used, but identical when spearman is used. I am still surpised why "fpkms" are not considered well normalized for library size. I thought that was the whole point of fragments per kilo million. I'd be grateful if you could share more on that.

  1. Thanks for letting me know about the issues with Ward. We will make the switch to WardD2.

  2. Yes, 15,000 data points because plotting all genes. The unbiased clustering is just to see how the data plays out to begin with. Indeed, there's more similarity across samples than the smaller changes between groups, so we don't see natural segregation. When I filter for DE genes, the heatmaps and clustering look very pretty as one would expect.

Thanks

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by nancy60

Hi nancy, FPKM was an earlier form of normalisation for RNA-seq data when sequencing was only being performed on 1 or a few samples. It is now considered inappropriate for differential expression comparisons across multiple samples. Here is the primary reason why:

in NGS, samples are always sequenced to different depths of coverage, even if they are sequence to a specific target depth of coverage. Factors within the sequencer itself and differences in sample prep. mean that one sample will always be sequenced to a higher or lower depth than another. What this translates to is more or less reads aligning to one sample over another when we conduct our analyses. This is critical in RNA-seq because we gauge expression of a particular gene based on the number of reads aligning.

FPKM does not adequately adjust for this fact that I've mentioned above, and neither does it's 'sister' methods, RPKM and FPKM-UQ. It means that an expression value of, for example, 1000 in one sample is not equivalent to 1000 in another sample, due to the fact that they would have been sequenced to different depths of coverage. Thus, when conducting differential expression analysis, the samples remain incomparable (because statistical tests look at the mean or median, and other factors influence these tests, too). One major issue with all of this is that large consortia, such as the TCGA, have released vast amounts of data in FPKM / FPKM-UQ to the community, apparently not caring too much about progressing bad science.

Another issue in RNA-seq is that larger genes naturally accumulate more reads, which makes sense; however, in these situations, it does not imply that the gene is more expressed than, say, a smaller ncRNA. FPKM does [yes] adjust for this fact.

Methods that do take into account both of the phenomena that I mention above include TMM (implemented in EdgeR) and DESeq2's 'media ratio' method.

Hope that this helps. FPKM, despite receiving its criticism, is still widely used, but it should not be when the goal is to conduct differential expression. If you search online, you will find lots of discussions about it; however, you should be aware that even the 'mastermind' behind FPKM questioned it's utility (in the video):

Here's another interesting few lines from a very well cited published work back in 2013:

the presence of these high-count genes clearly results in an inFated false-positive rate for five of the normaliza- tion methods (TC, UQ, Med, Q and RPKM). Only DESeq and TMM are able to control the false-positive rate while also maintaining the power to detect differentially expressed genes.

.

Based on three real mRNA and one miRNA-seq datasets, we confirm previous observations that RPKM and TC, both of which are still widely in use [40, 41], are ineffective and should be definitively abandoned in the context of differential analysis. The RPKM approach was initially proposed to account for differences in gene length [19]; however, the re- lationship between gene length and DE actually varies among the datasets considered here (Supplementary Figures S6–S8). Even in cases where a strong positive association between gene counts and length is observed, scaling counts by gene length with RPKM is not sufficient for removing this bias [16, 20].

[source: https://www.ncbi.nlm.nih.gov/pubmed/22988256 ]

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Kevin Blighe25k

Hi Kevin,

Thank you for your detailed reply. It will take me a while to assimilate and understand in as great detail. Having been on the lab side for several years, managing an NGS core lab - the only thing that strikes to me as odd is this part "NGS, samples are always sequenced to different depths of coverage... Factors within the sequencer itself and differences in sample prep. " - I must say that nowadays, if one is preparing standard bulk RNA-seq libraries, and running standard Illumina platforms, it is quite easy to accurately measure library concentrations and ensure equal output, down to +/- 2-5 million reads for libraries being sequenced in a batch. I agree that if you are comparing samples from different runs/experiments etc. it starts getting a little noisy, and in those cases I will now appreciate the accuracy from counts vs fpkm even if I don't fully understand the math/stat just yet. :-)

On a related note, since you earlier mentioned you were familiar with DESeq2 may I request your comment on another post I made, where I observed that (in a data set of n=2), the p-values for data points showing huge variation in replicates are quite low and comparable to other events where the replicates behave well.

How to filter for samples that show good concordance across replicates in DESEQ2 count data from RNA-SEQ?

This seemed rather counter-intuitive to me, the D.E estimated in the former case is likely to be random and not statistically significant? I would appreciate your thoughts on this. Thank you.

ADD REPLYlink written 10 weeks ago by nancy60

Okay, I have given a reply there. I actually branched from Computer Science into biology wet-lab into bioinformatics. The further I go into my career, I now enter more biostatistics, so, I worry about the numbers no more than I used to!

ADD REPLYlink written 10 weeks ago by Kevin Blighe25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1561 users visited in the last hour