Question

Influence of ribosomal depletion efficiency on PCA clustering

2

Entering edit mode

20 months ago

Hugo ▴ 30

Hi everyone,

I’m analyzing a RNAseq dataset and I’ve come across something that is giving me some headaches.

First some background: I have eight cyanobacterial samples from two different strains (four vs four) that show a (rather small) phenotypic difference under the same experimental conditions. My goal is to identify differentially expressed genes that can potentially explain this phenotypic difference. Though I wasn’t expecting to find a lot of DGEs since the strains are very similar.

Now the data; After aligning and counting, I found the total number of counts differs between samples (figure 1A), as it does too the ribosmal RNA depletion efficiency (figure 1B). All normal up to here, the rRNA depletion kits are not optimized for cyanobacteria and it is not the first time I see this discrepancy in efficiency between samples. A couple more observations, the percentage of rRNA correlates with the total number of counts (figure 1C) and after removing the rRNA counts, all samples have similar number of counts (figure 1D).

figure_1

I removed the rRNA counts and used DESeq2 to perform variance stabilization on the counts and perform PCA to see how the samples were clustering (figure 2). PC1 explains ~ 50% of the variance and, at first glance, it Iooked like it was separating the strains with the exception of S4. An outlier I thought.

figure_2

But then I looked again at the percentage of rRNA in each sample and realized the samples’ clustering in PC1 could also be explained by the rRNA percentage, I plotted this and the correlation is there (figure 3).

figure_3

So I’m wondering a few things.

Does it make any sense that most of the variation in the data is explained by the percentage of rRNA?
Or is this just an artifact?
If it is not, how can this be explained?
Does it make the data unusable for my original question?

Thank you in advance!

Note: in case it matters, the ribosomal depletion was performed with MICROBExpress kit.

bacteria ribosmal rna-seq depletion • 831 views

ADD COMMENT • link updated 20 months ago by LChart 3.9k • written 20 months ago by Hugo ▴ 30

score 2 · Accepted Answer · 2022-08-11

2

Entering edit mode

20 months ago

LChart 3.9k

As a practical note: If you worry that rRNA proportion impacts the estimation of other genes, you can (and likely should) include it as a covariate in DESeq2 with a design ~ strain + rrna_proportion.

As a technical note: it seems like you removed rRNA counts prior to normalization -- meaning that the library sizes associated with the counts are possibly not correct. The easiest way to adapt your current workflow would be to run estimateSizeFactors on the original counts; then remove ribosomes, and then use sizeFactors(de.obj) to pass in the originally-estimated size factors. This could result in a consistent over-estimation of expression of non-ribosomal genes; resulting in the observed PC1.

ADD COMMENT • link 20 months ago by LChart 3.9k

0

Entering edit mode

Thanks for you answer.

As for your practical note, that indeed sounds wise, will definitely implement it.

On the other topic, I gave a try to your suggestion but still get a very similar score plot for the PCA. But I thought removing rRNA before normalization was a common practice, isn’t it that the case? I’ve even seen pipelines where the rRNAs are removed at the counting step. Or is this done under the assumption that all samples have equal rRNA depletion efficiency?

In any case, after trying a few more things, I just came to terms with the fact that my two experimental conditions have very similar gene expression patterns and that’s why the rRNA depletion efficiency pops up as the highest source of variation. What I still cannot understand is how this efficiency can introduce such a bias, specially when the samples have a similar number of counts after removing the rRNA…

Anyway, thanks for your help.

ADD REPLY • link 20 months ago by Hugo ▴ 30

0

Entering edit mode

I thought removing rRNA before normalization was a common practice, isn’t it that the case?

It is, and it's the right way to do it. My suggestion was not a good one. Best to remove either before normalization, or after normalization (e.g., exclude from the TopTable); but not "use for library size and then remove and then normalize" which is...weird.

Now that you've removed the rRNA -- what happens if you run fGSEA on your PC1 loadings?

ADD REPLY • link 20 months ago by LChart 3.9k

1

Entering edit mode

If I run fGSEA on PC1 loadings I get a few significant (adj P-value < 0.01) terms, 7 when I use KEGG ids and 2 when I use GO terms (Biological Process).

I've also run fGSEA on the LFCs from DESeq2. I included rrna_proportion in the design as you suggested. Then I extracted LFCs for both strain or rrna_proportion separately. I get way more DEGs (adj P-value < 0.01) for rrna_proportion than for strain (1233 vs 89 genes) and the LFCs distribution is also very different (see volcano plot). volcano_plot

But when I run fGSEA on the strain LFCs I get a bunch of significant terms (13 from KEGG, 4 GO BP) most of them overlapping with the ones from PC1 loadings. And when I run fGSEA on the rrna_proportion LFCs I get nothing significant. I think it starting to make sense.

ADD REPLY • link 20 months ago by Hugo ▴ 30

0

Entering edit mode

Cool. I was more wondering whether the GO terms for PC1 were residual "translation"-type GO terms. You would expect the LFC distribution to be very different as rrna_proportion is continuous and strain is binary.

ADD REPLY • link 20 months ago by LChart 3.9k