Calculating TPM Values
1
3
Entering edit mode
6.1 years ago
biocool2018 ▴ 30

Hi All,

For one of my analysis I need to calculate TPM values. Unfortunately, I have only a count matrix with row HUGO annotated genes and columns are the samples. Unfortunately, our collaborator have deleted all intermediate steps so I have only the counts and a gtf table that is used to calculate the counts. For calculation of TPM I need gene lengths which I can get from the gtf file. However, I am not sure how accurate this would be not knowing which exons are expressed. Any one has any experience on this? Do I need to do the analysis from scratch (redo all star + feature counts for all my 200 samples)?

TPM RNA-Seq sequencing • 13k views
ADD COMMENT
0
Entering edit mode

For which purpose do you need TPM? That's not the "best" type of normalization, I would suggest using DESeq2 or edgeR normalised counts.

ADD REPLY
0
Entering edit mode

Thanks for the reply. I need to run the CIBERSORT software so I need absolute expression of genes and the authors suggest TPM for that.

ADD REPLY
4
Entering edit mode
6.1 years ago
Tao ▴ 530

Hi biocool2018,

You don't need to rerun everything. The read count table is enough to calculate TPM table. The thing is, you might need to use the length of union of exons instead of using the gene length.

Please see here for the source code transforming reads count to TPM. The following code is simplified version by removing meanFragmentLength since you might do not have the information.

Counts_to_tpm <- function(counts, featureLength) {

  # Ensure valid arguments.
  stopifnot(length(featureLength) == nrow(counts))

  # Compute effective lengths of features in each library.
  effLen <- featureLength

  # Process one column at a time.
  tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) {
    rate = log(counts[,i]) - log(effLen)
    denom = log(sum(exp(rate)))
    exp(rate - denom + log(1e6))
  }))

  # Copy the row and column names from the original matrix.
  colnames(tpm) <- colnames(counts)
  rownames(tpm) <- rownames(counts)
  return(tpm)
}

Please see here for the source code calculating length of union of exons.

Hope those would help!

Best,

Tao

ADD COMMENT
0
Entering edit mode

This is really helpful but just to understand the concept. Suppose in a hypothetical data a gene has 10 exons of which only one isoform that uses three of these exons. In this particular example is it true that the gene length should be the union of these three exons. When feature counts calculate gene length does it calculate the union of all expressed exons or all the exons in the gtf file.

ADD REPLY
1
Entering edit mode

In your example, featureCount would give the length of union of these three exons. You can read the 7.2.8 Program output part in featureCount manual, which I also paste here:

Column ‘Length’ always contains one single value which is the total number of non-overlapping bases included in a meta-feature (or a feature), regardless of counting at meta-feature level or feature level. When counting RNA-seq reads to genes, the ‘Length’ column typically contains the total number of non-overlapping bases in exons belonging to the same gene for each gene.

For definitions of feature and meta-feature, please read following copied from featureCount website:

Each entry in the provided annotation file is taken as a feature (e.g. an exon). A meta-feature is the aggregation of a set of features (e.g. a gene). The featureCounts program uses the gene_id attribute available in the GTF format annotation (or the GeneID column in the SAF format annotation) to group features into meta-features, ie. features belonging to the same meta-feature have the same gene identifier.

featureCounts can count reads at either feature level or at meta-feature level. When summarizing reads at meta-feature level, read counts obtained for features included in the same meta-feature will be added up to yield the read count for the corresponding meta-feature.

So, my experience is that you can use the gene length (length of union of exons) calculated from GTF file if you are doing research on gene-level not isoform level. In that case, the gene length for all samples would be same. Actually, gene length normalization is not so important since we usually compare the expression level of same gene between different groups. Please see the discussion in another thread here.

And I prefer to use the same gene length (union of exons from GTF file) when calculating TPM/FPKM. I will give an example here under the hypothesis that there is no library depth bias: suppose the expression of gene A in case group is isoform A1 only, with 200bp length, and we get 200 reads; and in control group, only isoform A2 with 400bp length is expressed, and we get 400 reads. If we normalize by isoform length, we will detect no expression difference of gene A between case and control, but if we normalize by same gene length, we will detect the difference. That's my personal opinion which might need more discussion.

ADD REPLY
0
Entering edit mode

Tao, Thanks for the detailed response. This is great and cleared my mind about this issue.

ADD REPLY
0
Entering edit mode

@biocool2018 Hi, I'm using CIBERSORTx for check cell types on RNA-Seq data, and I'm not sure about which count normalization it's better for input: CPM or TPM. I'm using LM22 as the signature matrix. The tutorial recommends that we should normalize the mixture file same as signature matrix, however, LM22 is microarray and I don't know how it was normalized. I also have access only to count matrix and using CPM will be easier, but I'm not sure if this is the best way. What signature matrix did you use? Why did you choose TPM?

ADD REPLY

Login before adding your answer.

Traffic: 2972 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6