Question

RNA-seq low quality sample in downstream analysis

2

Entering edit mode

3 months ago

Yingying ▴ 60

Hi all,

Our lab received bulk RNA-seq data for 6 samples from a collaborator (3 conditions, each in duplicate). We're planning to perform differential expression analysis using DESeq2 or edgeR, but noticed that one sample has ~29% of reads mapped to mitochondrial genes—much higher than the 1–5% range seen in the others. We suspect this could be due to RNA degradation and poor sample quality.

Given our limited sample size, we’d like to retain this sample if possible. What are some recommended approaches to minimize the impact of this poor-quality sample in our analysis? Would you suggest filtering out mitochondrial genes, or using specific normalization methods? Would the conventional DESeq2 pipeline be reliable?

Appreciate any insights or best practices! Thanks in advance.

Mitochondrial genes RNA-seq DESeq2 • 1.2k views

ADD COMMENT • link updated 12 weeks ago by Aleksandra ▴ 190 • written 3 months ago by Yingying ▴ 60

0

Entering edit mode

I cannot say I ever bothered with mt content in bulk rather than single-cell RNA-seq. As long as you have enough reads to avoid dropouts of many genes I think you could still normalize data properly. For this, one would need some plots, e.g. PCA, MA-plots, overview of library depth etc.

ADD REPLY • link 3 months ago by ATpoint 89k

0

Entering edit mode

Hi thanks for your kind reply. The number of genes detected per sample is roughly the same, and the PCA is showing the expected pattern. But I did notice that the total reads per sample after DESeq2 median-of-ratios normalization are a bit weird.

In detail, the total raw reads for my 6 samples are:

 33734449 39615300 **52799011** 27475040 33970007 43363614

After normalization the total reads per sample are:

33499714 34653430 **57137028** 35199458 33363112 34128439

Size factors from sizeFactors(dds) are:

1.0070071 1.1431855 **0.9240770** 0.7805529 1.0181906 1.2706006

What I usually see from other datasets is roughly equal total reads after normalization. To me, the third sample (the one with very high mitochondrial gene) has a weird size factor, affecting reads after normalization. Is this something expected or worrisome?

ADD REPLY • link 3 months ago by Yingying ▴ 60

0

Entering edit mode

I do not see anything weird here. RNA degradation can be checked in the lab via RIN, that must have been done as basic QC before library prep to not go blind. I am not sure what the issue is. You say PCA is fine, results are the same essentially with or without that sample. So what is the actual issue here? Mitochondrial genes are normal biology, don't remove them. PCA is based on variable genes so if PCA does not change after removing them then they are not top contributors to variation.

ADD REPLY • link 3 months ago by ATpoint 89k

0

Entering edit mode

Thanks for your explanation. I get your points!

ADD REPLY • link 3 months ago by Yingying ▴ 60

score 3 · Answer 1 · 2025-07-22

3

Entering edit mode

3 months ago

Aleksandra ▴ 190

Hi Yingying, The high MT read percentage in your sample is a classic symptom of apoptosis-driven RNA degradation, which severely skews library size factor estimation in both DESeq2 and edgeR.

First, perform a Principal Component Analysis (PCA) on the VST-normalized counts of all genes. This will determine if the sample is a systemic outlier beyond just the MT contamination.
Filter all mitochondrial genes (chrM) from your count matrix before normalization. The MT-transcriptome is irrelevant to your nuclear gene expression study and must be treated as a technical artifact.
Re-run the PCA on the filtered, MT-free dataset. The outcome dictates your next step: If the sample now clusters with its biological replicate, the issue was contained. You can proceed with DESeq2 on this cleaned matrix. If the sample remains a distinct outlier, it is compromised by factors beyond MT contamination. It must be discarded. While the DESeq2 pipeline is robust, it is not immune to extreme library composition bias. Retaining a systemically aberrant sample, even with a small n, will inflate dispersion estimates and critically compromise the statistical power of your entire experiment. It is statistically preferable to proceed with n=1 for one condition than to retain a sample that introduces massive, unexplained variance.

ADD COMMENT • link 3 months ago by Aleksandra ▴ 190

2

Entering edit mode

Hi Aleksandra, thanks for your kind explanation. PCA plot of our original normalized data looks as expected, and really does not change after I filter out all the mitochondrial genes. The statistically significant DEG results are also very similar, with only 8 DEGs identified in original data but absent in filtered data. What are your thoughts on this? Is keeping mito genes acceptable? Or what other QCs should we consider? Thanks a lot!

ADD REPLY • link 3 months ago by Yingying ▴ 60

0

Entering edit mode

Thank you for your response. This is an excellent result!! Regarding the PCA graph: As expected, there has been little change to the PCA graph. PCA visualises the largest sources of variance in your data. In a well-designed experiment, the strongest signal should represent biological differences between conditions rather than technical artefacts. The fact that clusters of samples remain in the same place indicates that your biological signal is strong and reliable. Although significant, MT contamination was a secondary source of variance and thus did not determine the primary components. This is a good sign. Regarding the eight 'missing' DEGs: This is the most important result and shows that, by removing the MT genes and correcting the normalisation error, eight false positives have been successfully eliminated. These genes were probably only 'significant' in the original analysis because the high MT content in one sample distorted the size estimate of the entire library, creating an artefactual difference. By cleaning the data, you have made your results more accurate and reliable. To confirm this, please check the size factors generated by DESeq2 for both analyses (before and after filtering). You will see that the outlier sample had a clearly different size factor in the original analysis.