Unexpected separation of RNA-seq samples on PCA plot
Entering edit mode
16 days ago
Tihana ▴ 10

Hello Biostars,

I have been dealing with an interesting issue regarding 9 RNA-seq green algae samples. These samples come from three conditions, each having 3 biological replicates. To remove rRNA from the samples, rRNA was depleted with varying success in order to keep chloroplast and mitochondrial transcripts.

When estimating the proportion of rRNA per sample, 4 seemed to have a high number of rRNA left over. Varying from 28.5% to 35.5%.

Estimated rRNA proportions per sample:

Chlo_auto_A 28.51%
Chlo_auto_B 9.54%
Chlo_auto_C 2.73%
Chlo_mixo_A 9.02%
Chlo_mixo_B 3.56%
Chlo_mixo_C 35.51%
Chlo_hetero_A   28.69%
Chlo_hetero_B   2.43%
Chlo_hetero_C   29.86%

Moreover, when performing DE very few genes come up being differentially expressed due to the high heterogeneity between the biological replicates. Because of this I created a PCA plot using CPM values. The samples cluster as expected, except auto_A, hetero_A, hetero_C and mixo_C. Which, prior to cleaning had a high rRNA proportion.here is the PCA plot

Has anyone experienced anything similar? Or have any idea of how I could use this data to perform a DE analysis?

Any and all advice would be appreciated!

depletion PCA rRNA plot RNA-seq • 407 views
Entering edit mode

It seems that all the samples that don't cluster together are easily explained by higher rRNA proportions. Could you remove all rRNA reads from the samples are rerun the PCA? How many reads are you left with after removing rRNA?

If there are still a reasonable number of reads, and samples cluster together without rRNA, you could still use these samples for DE. The simplest solution is just run your DE workflow, and don't include the rRNA in any downstream analyses.

Entering edit mode

That was my thinking as well. However, this PCA plot is done with the rRNA reads removed. Here are the numbers of reads left over once the rRNA reads were removed:

Chlo_auto_A 34200682
Chlo_auto_B 60878948
Chlo_auto_C 57164502
Chlo_mixo_A 50950794
Chlo_mixo_B 46311471
Chlo_mixo_C 32272467
Chlo_hetero_A   45324032
Chlo_hetero_B   57111969
Chlo_hetero_C   41423468

I did a DE analysis with this set as well and there are very few differentially expressed genes found. And the variance between the samples within the same condition is higher than between conditions here's an example

I guess that if the auto_A and hetero_B samples were exchanged, it would somewhat make sense.

Entering edit mode

Oh, then that is more strange than I had though if rRNA is already removed. But I agree, it does look like maybe two samples were swapped at some point, but given the clustering on the PCA, it's not so clear. You have plenty of reads left, so I can't think of any obvious issues related to that.

Have you looked at a presence/absence PCA?

I can't think of any other suggestions for now, but I'll write more if something comes to mind.

Entering edit mode

Can you check the number of genes with counts of zero per sample and correlate that with the rRNA percentage? It could be that the rRNAs consume a lot of reads, leading to dropouts for many genes in those that have high rRNA. Even after removal of the rRNA counts these samples might simply be undersequenced and that drives up variation. Correlate both the total readcount (with rRNA removed) and number of genes with 0 counts against rRNA.

Entering edit mode


here's a table with the total read count, number of genes with a count of 0 and the proportion of rRNA reads:

 total read count   number of genes with count 0    rRNA proportion

Chlo_auto_A 34200682    63  28.51%
Chlo_auto_B 60878948    52  9.54%
Chlo_auto_C 57164502    82  2.73%
Chlo_mixo_A 50950794    47  9.02%
Chlo_mixo_B 46311471    83  3.56%
Chlo_mixo_C 32272467    32  35.51%
Chlo_hetero_A   45324032    32  28.69%
Chlo_hetero_B   57111969    79  2.43%
Chlo_hetero_C   41423468    29  29.86%

There seems to be very few genes with count 0, in total there are about 12000 annotated genes. And it does not seem to correlate with the high rRNA proportion. And I believe that there are still enough reads even in the samples where many were removed.

Entering edit mode

So far I haven't performed presence/absence PCA on data like this.

Just another interesting observation, prior to cleaning the rRNA reads the correlation matrix between two conditions looked like this: enter image description here

but when the rRNA reads were cleaned the mixo_C sample seemed to show less correlation to other samples of the same condition:

enter image description here


Login before adding your answer.

Traffic: 2298 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6