Convert read counts to RPKM or TPM values: Numbers don't look right?
3
1
Entering edit mode
6.1 years ago
Tom ▴ 40

Hi all, I have a question about converting read count data from RNASeq data to expression values.

For example, this is a sample of read count data from GSM1375434_p01_counts.txt from the study GSE57110 in the GEO database:

name                read_count
ENSRNOG00000000001  4
ENSRNOG00000000007  1278
ENSRNOG00000000008  3
ENSRNOG00000000009  0
ENSRNOG00000000010  28
ENSRNOG00000000012  0
ENSRNOG00000000014  47
ENSRNOG00000000017  0
ENSRNOG00000000021  48
ENSRNOG00000000024  77
ENSRNOG00000000028  4
ENSRNOG00000000029  80


I have a number of samples/files like this, and I want to turn these read counts into expression values.

I found a helpful guide to doing that here, however, when I follow the instructions, I think my numbers are unreasonable.

I followed the above instructions to obtain a TPM score for each gene (I also have a similar table for RPKM, which doesn't make sense either). If someone could tell me where I've gone wrong with this table I would appreciate it. In the below table:

• name = Ensembl Gene ID, obtained from the GSM1375434_p01_counts.txt file (names shortened for this example)
• start = gene start position, obtained from BioMart (NOTE, for this example, I have shortened these numbers for presentation purposes)
• end = gene end position, obtained from BioMart (NOTE, for this example, I have shortened these numbers for presentation purposes)
• length = end - start
• length_in_kbp = length /1000
• RPK = length_in_kbp / read_count
• RPK/scaling_factor = RPK / (sum of all the RPK values / 1,000,000)
>     name    read_count  start   end length_(bp) length_in_kbp   RPK RPK/scaling_factor
>     RNOG001     4       23066   23069   1420       1.42         0.355    27053
>     RNOG007    1278      5686   5690    40761      40.761       0.031   2430
>     RNOG008    3         8254   8258    36572      36.57        12.19   929030
>     RNOG009    0         1047   1049    16385      16.38        0       0
>     RNOG010    28        2060   2061    3809       3.80         0.136   10367
>     RNOG012    0         1476   1478    6254       6.25         0       0
>     RNOG017    0         2543   2544    10588      10.58        0       0
>     RNOG021    48        1519   1525    1201       1.20         0.025   1906
>     RNOG024    77        1689   1690    29514      29.51        0.383   29210

>     Total_RPK   13.12
>     Total_RPK_Divided_By_A_Million  1.31E-05


So as you can see, my values for RPK/scaling_factor are in some cases high (like >300,000). I got similar numbers for a table that I did for RPKM instead of TPM. I'm wondering if someone could point out (possibly with an example for one line) how I mis-calculated the TPM for each gene? One thing I was thinking is maybe I took the wrong gene length, I simply used BioMart to get the start and end position for each gene, maybe I was meant to sum all the transcript lengths for a gene? However the information I'm given in GEO is only for genes, not transcripts?

Thanks

gene expression tpm rpkm geo • 12k views
0
Entering edit mode

This thread has code examples if you want to double-check your calculations: Calculating TPM from featureCounts output

1
Entering edit mode
6.1 years ago
apa@stowers ▴ 580

0
Entering edit mode

Many thanks, but even if I swap those two numbers around; my numbers are still unrealistic?

The last two columns of the table would be changed to (e.g for the the first row, read count 4 / gene kb length 1.42 = 2.81):

RPK    RPK/scaling_factor
2.816901408 33463.20516
31.35349967 372461.9501
0.082029968 974.4699084
0           0
7.351010764 87325.87536
0       0
0           0
39.96669442 474781.8617
2.608931355 30992.63783

Total RPKM: 84.17906759


Total RPKM/1,000,000: 8.41791E-05

1
Entering edit mode
6.1 years ago
Tom ▴ 40

Actually I'm come up with a sort of solution. I've noticed if I remove the genes with a gene length < 300 base pairs (extremely short genes) and if I remove the genes whose number of reads is less than the average number of reads across all of the genes (so removing genes with very low read counts), then when I do the above table, the numbers are "more reasonable", the average expression is 40.63, range is 0.03-65564, and there is still some what I think are unusual outliers (e.g. the highest gene expression value is 65564), but then when I log2 transform these, the range is from -4 to 16, which looks reasonable.

This is obviously ad-hoc and maybe there are more scientifically-robust ways of doing this "quality filtering", but I thought I would just share this as a possible solution, that it's not a problem with the way the calculation is being done, but rather with the values themselves.

1
Entering edit mode

Perhaps you should consider setting aside manual TPM/RPKM calculations and use DESeq2 or edgeR to do your analysis.

0
Entering edit mode

Thank you. I did look into edgeR, but the problem is that I want to turn my read counts into actual "gene expression values" for my experiment, not just do a differential expression analysis straight from the read counts.

The EdgeR manual says "EdgeR is concerned with relative changes in expression levels between conditions, but not directly with estimating absolute expression levels." (I'm assuming DESeq2 is similar). So that's why i'm doing it manually.

1
Entering edit mode

You can get those counts from DESeq2 (and probably edgeR) as well. See this: https://support.bioconductor.org/p/66067/
retrieve normalised count data from DESeq2

For edgeR: Need to export EdgeR normalized counts

0
Entering edit mode

Thanks for everyone's help, there's a script here here using EdgeR, just leaving it in case it helps anyone in future.

1
Entering edit mode
6.1 years ago
mforde84 ★ 1.4k

I wrote: https://github.com/mforde84/tpmcalc to convert counts to tpm and log transformed tpm, where log(tpm = 0) is equal to a randomly generated normal distribution where mean is min(log(tpm>0)) - 1, sd +/- 1. you'll need the feature size to do the calculation. also, i typically use the gene model feature size for calculations, it may be more accurate to use average transcript feature sizes instead, but it doesn't seem necessary in my application. it might even make sense to use counts per million (CPM) instead, as it avoids the who fragment size correction thing.

$git clone https://github.com/mforde84/tpmcalc$ cd tpmcalc
$g++ -o tpmcalc tpmcalc.cpp --std=c++1z$ ./tpmcalc --help
./tpmcalc begin-size-column annotation-column start-of-counts-column counts-file tpm-output logtpm-output
\$ ./tpmcalc 2 1 4 /path/to/sample_data.txt /path/to/sample_data_tpm.txt /path/to/sample_data_log2tpm.txt