RPKM and TPM when operating with small group of genes: is there a bias?
1
0
Entering edit mode
21 months ago
Macspider ★ 3.5k

Hi,

I am currently analysing a set of ~700 genes for which I have RNAseq read counts. I have calculated the differentially expressed genes and their relative rpkm and tpm, but I am left with a question that is more theoretical.

When I calculate the RPKM for each gene in each sample, the values I get are somewhat in the range of what I'm used to see when I do RNASeq data analysis.

> max(rpkm[ , "F3_A_R1"])
[1] 681.6962
> min(rpkm[ , "F3_A_R1"])
[1] 0


When I convert to TPM, the values go overboard.

> max(tpm[ , "F3_A_R1"])
[1] 158783.9
> min(tpm[ , "F3_A_R1"])
[1] 0


This is how I'm calculating RPKM:

# sum counts for each sample, then divide each sum by a million
scalingFactors <- colSums(raw_cts) / 1e6
# divide raw counts by their scaling factor
rpm <- raw_cts / scalingFactors
# divide each rpm by the gene length to get rpkm
rpkm <- rpm / orf_lengths


And this is how I'm converting to TPM:

tpm <- rpkm
for (i in 1:length(colnames(tpm))) {
tpm[ , i] <- ( tpm[ , i] / sum(tpm[ , i]) ) * 1e6
}


Basically, I use the known formulas, as explained for example here.

Why do you think that there is such a difference between TPM and RPKM? My guess is that the limited number of genes makes both RPKM and TPM very unpredictable / unreliable as a measure.

Do you have any suggestion, on how to balance this out?

EDIT #1:

I have tried to add a "fake" gene where to put all the "unassigned" reads. This has reduced the RPKM / TPM values drastically, the max that I observe now is RPKM = 43.4. Of course this approach has the opposite issues, but I tend to consider it more reliable since I am using these RPKMs and TPMs to estimate relative abundance of transcripts within samples, later on.

EDIT #2:

I decided to not use RPKM and TPM. Although these are normalizeed measurements, with such few genes it is unreliable to use them. Instead of solving biases, they introduce some.

RPKM TPM RNASeq Statistics Sequencing • 1.2k views
1
Entering edit mode

I don't think that RPKM or TPM are meant to be calculated for a handful of genes, you need to compute them genome-wide and keep the same values after subsetting genes.

0
Entering edit mode

In my workflow, the reads are mapped to certain known genes of a metagenome. While I cannot include other genes because these 700 are all I have, I still have the total number of reads. Would it help, to create a fake gene called "not_assigned" to which I assign all the reads that aren't assigned to any of the 700 genes in my gene set? And then keep it in for the RPKM and TPM calculation

0
Entering edit mode

It is not at all clear to me that those normalizations are sound when used on only a subset of genes.

0
Entering edit mode

Yep, that is exactly my doubt.

1
Entering edit mode
21 months ago
Macspider ★ 3.5k

I am answering to my own question after some brainstorming. I reported the same as "EDIT #2" in the question. I decided to not use RPKM and TPM. Although these are normalizeed measurements, with such few genes it is unreliable to use them. Instead of solving biases, they introduce some.

0
Entering edit mode

EDIT#1 seems more reasonable than EDIT#2 depending on your experimental plan, i.e. if you want to perform comparisons in one genome between multiple conditions, a normalized expression level would be much more reliable than raw sequencing depth for a given gene. Imagine that your sequencing depth is not the same between two samples? The whole analysis is wrong. Even if you want to compare genes expression within a given sample, normalization across genomes can be an interesting choice, especially RPKM that are designed to rank genes expression within a given sample (while TPM are, by design, better to compare a given gene expression between samples).

If you are analyzing metagenomes expression, I would add the species/genus name in a given column of your data frame and get all the expressed genes in one table where all expression levels are normalized. You estimate individual genes expression from a pool of reads, just like in classical RNA-seq, in metatranscriptomics, it just happens that these reads are coming from different organisms. Thus, I think it's still better to normalize genes expression.

Never analyzed metatranscriptomics, so please, read my comment with a grain of salt.

1
Entering edit mode

Yours are all valid points which I had taken into consideration in early days of this analysis.

Here's the situation: We don't have a reference transcriptome to map the reads against, but only this shortlist of characterized genes for our meta-transcriptome. However, when sampling RNA, we sequence everything that is transcribed (more or less). Hence, the reads originate from all genes present in the microbial community, but the mapping targets are only our subset of genes. Most reads go unmapped due to this.

Of some million reads, only some tens of thousands map to the genes we want, so normalization may lead to unexpected results when working with such a low abundance of reads. What we're doing is calculating relative abundances within each sample.

0
Entering edit mode

TPM is "relative abundance". A TPM of 158783.9 is not incorrect and it's not overboard -- it means that that transcript's abundance is high (relative to the other mapped transcripts).

Consider this hypothetical situation: If all of your reads mapped to two transcripts, one of those transcripts is guaranteed to have a TPM >= 500000.

0
Entering edit mode

You are actually right. Thanks for this clarification. However, if I am not sure of the RPKM estimations, how can I be sure of the TPM calculated from RPKMs?

0
Entering edit mode

I don't see how you can be unsure of the RPKM; you take the number of reads mapped to a transcript, divide by its length (in kilobases), and divide by the total number of mapped reads.

There's really no other way to be sure vs. unsure unless you misapply the mathematical formula.

Sum of TPMs will equal one million.

If you add some "fake gene", you'll have more total number of mapped reads so your RPKMs will be lower. But that "fake gene" will also 'consume' your TPM abundance so your TPMs will be lowered accordingly. Relative abundances don't change (e.g. ratio of transcript A to transcript B to transcript C etc. will remain the same).

0
Entering edit mode

My problem is that I have no way to ensure if the reads that don't map to my subset of genes don't map at all, or simply we haven't sequenced and characterized their gene of origin yet.

0
Entering edit mode

Well, you can only work with what you have. You can basically only characterize the abundance of the genes you have relative to one another; if there happens to be an unannotated/unmapped gene not expressed at all in sample #1 but constitutes the majority of the reads in sample #2, your between-samples comparisons will of course be messed up. The DESeq method for between-samples comparisons works better than TPM, but it's still best if you actually have that gene annotated.

A suggestion I have would be to try to complete the reference (e.g. use an assembler). However, I've never done metatranscriptomics analysis so I'm not qualified to provide solutions to the problem; I can only tell you what can happen with the math.

0
Entering edit mode

Improving the assembly and generating a reference metatranscriptome is what we're set out to do in the next 12 months :)