Question: Is there a good way to do a PCA of gene expression including samples from different species?
gravatar for colin.kern
9 months ago by
United States
colin.kern900 wrote:

I want to do a PCA plot from RNA-seq data that I have from a handful of tissues from five different species. I essentially just want to see that they cluster more by tissue rather than by species, and also for some of the tissues where I don't have an exact match across all the species (i.e. brown adipose in some and white adipose in others) I want to see how close they are compared to other tissues.

Of course I can only use the genes for which there are orthologs in all the species, but I think I will still have an issue with the expression values themselves. I have TPMs from StringTie right now, but because the species have different numbers of genes in their annotations, I don't think the TPMs are comparable between species. Would quantile normalization be a good thing to do? Should I recalculate the TPMs using only the orthologs so there's the same number of genes in each species? Should I do something else entirely?

rna-seq • 397 views
ADD COMMENTlink modified 9 months ago by ATpoint34k • written 9 months ago by colin.kern900
gravatar for kristoffer.vittingseerup
9 months ago by
European Union
kristoffer.vittingseerup3.2k wrote:

I would argue that TPM should be suitable since the interpertation of a TPM value (according to amongst other this excellent blog)

if you were to sequence one million full length transcripts, TPM is the number of transcripts you would have seen of type i, given the abundances of the other transcripts in your sample.

Harold then continues:

The last “given” part is important. The denominator is going to be different between experiments, and thus is also sample dependent which is why you cannot directly compare TPM between samples. While this is true, TPM is probably the most stable unit across experiments.

What the last quote referes to is that an inter-library normalization always is needed - a fact described in one of his other blogs. Meaning if you just make sure to do the inter-library normalization you should be fine.

Also you can get inspiration from this article (which includes being aware of batch effects).

ADD COMMENTlink modified 9 months ago • written 9 months ago by kristoffer.vittingseerup3.2k
gravatar for ATpoint
9 months ago by
ATpoint34k wrote:

I have to disagree with kristoffer.vittingseerup on that matter. TPM, like all methods that are entirely based on per-million scaling without further corrections suffer when it comes to correction for changes in library composition. This is an issue in "normal" RNA-seq when simply comparing the same cell type of the same species between conditions (like a certain treatment). It is most likely even more an issue when you compare species as between species you might have gains or losses of genes, notable expression differences, changes in gene length etc, so I do expect a notable change in composition. TPM does not account for this. It is good to compare transcript composition within a single sample. The thing is that the common normalization techniques like TMM from edgeR, RLE from DESeq2 or transformations such as vst and rlog all assume that most genes do not change, which I at least find questionable between species, both towards the biological reality and the completeness of the reference transcriptomes which might further impose technical difficulties.

I am surprised that these naive per-million methods are still in use and recommended even by more experienced folks as there is plenty of literature out these that recommends against it. Some brief examples (there are much more benchmarking papers on this out there):

As shown in Table 2, in many comparisons, Total Count and RPKM/FPKM perform worse than all other methods, and several authors expressly recommend against its use [9].

(...) TC and RPKM do not improve over the raw counts

If you search Biostars and the web for opinions on RPKM/FPKM/TPM for meaningful differential between-sample analysis you see that the statistics community strongly argues against it. The Bioconductor support page is full of threads towards this and the package maintainers of the established tools typically recommend against it.

As for your question: => I would definitely survey the literature for dedicated approaches that tackle all these points, e.g. SCBN ( and not try and naive/ad-hoc methods as these might give you skewed results.

ADD COMMENTlink modified 9 months ago • written 9 months ago by ATpoint34k

I understand that most statisticians argue against TPM etc, but the thing is, what other metric would one use? There are cases when one cannot rely on a downstream algorithm to account for other variables, so until we get to an alternative metric - an actual solution - TPM might have to do as the best possible estimate.

I am looking at a middle ground right now - load up raw counts into DESeq2 with the design formula, let it calculate size factors, etc, then export counts(..., normalized = TRUE). If this can be turned into TPM, we would account for library size, batch effects as well as gene length.

ADD REPLYlink modified 9 months ago • written 9 months ago by RamRS27k

I do not see how this would compensate for batch effects. DESeq2 by default does not correct for this. Of course you can try, and you even should, but be sure to make exploratory MA-plots to check if there is evidence that indeed the assumption of most genes not changing is not violated.

ADD REPLYlink written 9 months ago by ATpoint34k

When is the batch effect accounted for in DESeq2? I thought that since the design formula (which includes the batch as a covariate) is specified when the dataset is imported, all operations would take the batch effect covariate into account. Is that assumption mistaken?

ADD REPLYlink written 9 months ago by RamRS27k

You can include batch in your design formula in DESeq, but that's just another variable to DESeq. It doesn't treat it specially to adjust for batch effects. The normalization, I believe, is independent of the design formula you give. It's only in the calculation of DEG p-values that it uses the formula to adjust for batch effect, or any other variables.

ADD REPLYlink written 9 months ago by colin.kern900

Thanks for the interesting discussion. Since my only goal is just to look at clustering of the samples, I'm wondering if quantile normalization and PCA, or even some kind hierarchical clustering using Spearman correlation would avoid many of these pitfalls, since I think in those cases the relative differences between genes within a sample is what would determine the clustering, rather than the actual differences in the value of expression between samples.

ADD REPLYlink written 9 months ago by colin.kern900

I'm sorry about hijacking your post. Clustering should be an easier topic to tackle than unifying datasets in a comparable manner.

ADD REPLYlink written 9 months ago by RamRS27k

I completely agree and did try to highlight the need for an inter-library normalization - now highlighted more. Seems like SCBN requires knowledge of housekeeping genes (aka genes not changing) - is that not a larger assumption than most genes are not changing (or that the amount of up and down regulation is balanced)?

ADD REPLYlink written 9 months ago by kristoffer.vittingseerup3.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1749 users visited in the last hour