Question

Mitochondrial genes - TPM calculation bulk RNA-Seq

0

Entering edit mode

5 months ago

nick_b55 ▴ 10

Hello all,

I was wondering if any of you have encountered a situation for bulk RNA-Seq where, possibly due to low sample quality or presence of dead cells, mitochondrial genes are expressed to a very large degree relative to other genes, thus skewing TPM values of all nuclear genes (by effectively scaling them down).

In such a situation, what could I potentially do to alleviate this beyond preparing fresh samples? Could I, for example, exclude 'MT-' genes from samples and then recalculate TPM based on this filtered set of genes?

Many thanks in advance for any insight

TPM RNA-Seq mtDNA • 1.1k views

ADD COMMENT • link 5 months ago by nick_b55 ▴ 10

0

Entering edit mode

That is called a composition bias and is exactly why TPMs and other per-total scaling techniques are a poor choice. More intelligent approaches such as the size/norm factor methods from edgeR and DESeq2 are better at compensating for that. What is the analysis you plan to do?

ADD REPLY • link 5 months ago by ATpoint 82k

0

Entering edit mode

Thank you for your reply.

I am wanting to give an idea of expression levels of certain genes within my samples (these are replicates of the same cell type). This is not for cross-sample type comparisons, as I know TPM, FPKM etc. are inappropriate for this. However, I thought if making a within-samples comparison then TPM was a usable measure in this instance.

I had thought normalisations used by edgeR and DESeq2 were robust for comparing the same gene(s) across different sample types, but not for my within-samples analysis as they don't factor in gene length, for example.

Thanks again

ADD REPLY • link 5 months ago by nick_b55 ▴ 10

1

Entering edit mode

For within-sample you do not need any normalization. It's just ratios, maybe adjusted for gene length indeed, but everything else is just a linear scaling factor that does not change the ratio, or am I missing something here?

ADD REPLY • link 5 months ago by ATpoint 82k

0

Entering edit mode

The goal was ultimately to try and classify gene expression levels. As I know the EMBL Expression Atlas for example uses cutoffs to define low, medium and high gene expression based on TPMs I had hoped to use their criteria. However, due to the aforementioned mitochondrial transcript issue many genes are having their TPM's 'squashed' to extremely low values or even < 0.5 TPM (this also applies for genes we know to have reasonable expression based on qPCR etc. done previously)

However when I recalculated TPMs after having removed MT- genes then TPMs are more in line with other RNA-Seq data we have access to for these cell types

So I suppose I wanted to get some idea of the validity of calculating TPM with mitochondrial genes removed.

Thanks again

ADD REPLY • link 5 months ago by nick_b55 ▴ 10

0

Entering edit mode

There aren't really any good cutoffs for TPMs. The sum across all transcript's TPM values will always equal 10^6.

You can say: The transcript is more highly expressed than this other transcript, but that's about all you can do semi-reliably.

By recalculating TPMs after removing MT genes, all you're doing is simply rescaling. I can remove every gene except two genes: One gene might have a TPM of 200000 and the other gene might have a TPM of 800000. That doesn't mean that either gene is highly expressed -- it just means the second gene is more highly expressed than the first gene.

So it's not wrong to remove the MT genes -- my point is to not over-interpret TPMs because RNA-seq is a technology that gives you "relative abundance". If your MT genes have higher relative abundance than most other genes, that just means your sample has high mitochondrial content.

ADD REPLY • link 5 months ago by dsull ★ 5.9k

0

Entering edit mode

That makes sense, thank you.

ADD REPLY • link 5 months ago by nick_b55 ▴ 10

0

Entering edit mode

Problem is that counts are only in part a consequence of expression level. Other factors are mappability and (sample-specific) biases such as GC/PCR bias. So I am not sure I would believe that a gene with low counts is "lowly-expressed" and make biological inference on that. Seconding @dsull and what I already said, I would settle more for ratios (geneA is higher/lower than geneB) rather than going for strict cutoffs here. If you plot ranked TPMs you will see that the inflection point of the curve (by eye) is not easy to pinpoint, it has large standard error no matter where you put it, and the same is true for any other arbitrary cutoff.

ADD REPLY • link 5 months ago by ATpoint 82k

0

Entering edit mode

I see, that makes sense thank you. Not necessarily as simple as I may have thought to attribute counts to expression levels.

ADD REPLY • link 5 months ago by nick_b55 ▴ 10