Question: Calculating TPM Values
2
gravatar for biocool2018
14 months ago by
biocool201820
biocool201820 wrote:

Hi All,

For one of my analysis I need to calculate TPM values. Unfortunately, I have only a count matrix with row HUGO annotated genes and columns are the samples. Unfortunately, our collaborator have deleted all intermediate steps so I have only the counts and a gtf table that is used to calculate the counts. For calculation of TPM I need gene lengths which I can get from the gtf file. However, I am not sure how accurate this would be not knowing which exons are expressed. Any one has any experience on this? Do I need to do the analysis from scratch (redo all star + feature counts for all my 200 samples)?

sequencing rna-seq tpm • 2.6k views
ADD COMMENTlink modified 14 months ago by Tao300 • written 14 months ago by biocool201820

For which purpose do you need TPM? That's not the "best" type of normalization, I would suggest using DESeq2 or edgeR normalised counts.

ADD REPLYlink written 14 months ago by WouterDeCoster39k

Thanks for the reply. I need to run the CIBERSORT software so I need absolute expression of genes and the authors suggest TPM for that.

ADD REPLYlink written 14 months ago by biocool201820
1
gravatar for Tao
14 months ago by
Tao300
Tao300 wrote:

Hi biocool2018,

You don't need to rerun everything. The read count table is enough to calculate TPM table. The thing is, you might need to use the length of union of exons instead of using the gene length.

Please see here for the source code transforming reads count to TPM. The following code is simplified version by removing meanFragmentLength since you might do not have the information.

Counts_to_tpm <- function(counts, featureLength) {

  # Ensure valid arguments.
  stopifnot(length(featureLength) == nrow(counts))

  # Compute effective lengths of features in each library.
  effLen <- featureLength

  # Process one column at a time.
  tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) {
    rate = log(counts[,i]) - log(effLen)
    denom = log(sum(exp(rate)))
    exp(rate - denom + log(1e6))
  }))

  # Copy the row and column names from the original matrix.
  colnames(tpm) <- colnames(counts)
  rownames(tpm) <- rownames(counts)
  return(tpm)
}

Please see here for the source code calculating length of union of exons.

Hope those would help!

Best,

Tao

ADD COMMENTlink written 14 months ago by Tao300

This is really helpful but just to understand the concept. Suppose in a hypothetical data a gene has 10 exons of which only one isoform that uses three of these exons. In this particular example is it true that the gene length should be the union of these three exons. When feature counts calculate gene length does it calculate the union of all expressed exons or all the exons in the gtf file.

ADD REPLYlink written 14 months ago by biocool201820
1

In your example, featureCount would give the length of union of these three exons. You can read the 7.2.8 Program output part in featureCount manual, which I also paste here:

Column ‘Length’ always contains one single value which is the total number of non-overlapping bases included in a meta-feature (or a feature), regardless of counting at meta-feature level or feature level. When counting RNA-seq reads to genes, the ‘Length’ column typically contains the total number of non-overlapping bases in exons belonging to the same gene for each gene.

For definitions of feature and meta-feature, please read following copied from featureCount website:

Each entry in the provided annotation file is taken as a feature (e.g. an exon). A meta-feature is the aggregation of a set of features (e.g. a gene). The featureCounts program uses the gene_id attribute available in the GTF format annotation (or the GeneID column in the SAF format annotation) to group features into meta-features, ie. features belonging to the same meta-feature have the same gene identifier.

featureCounts can count reads at either feature level or at meta-feature level. When summarizing reads at meta-feature level, read counts obtained for features included in the same meta-feature will be added up to yield the read count for the corresponding meta-feature.

So, my experience is that you can use the gene length (length of union of exons) calculated from GTF file if you are doing research on gene-level not isoform level. In that case, the gene length for all samples would be same. Actually, gene length normalization is not so important since we usually compare the expression level of same gene between different groups. Please see the discussion in another thread here.

And I prefer to use the same gene length (union of exons from GTF file) when calculating TPM/FPKM. I will give an example here under the hypothesis that there is no library depth bias: suppose the expression of gene A in case group is isoform A1 only, with 200bp length, and we get 200 reads; and in control group, only isoform A2 with 400bp length is expressed, and we get 400 reads. If we normalize by isoform length, we will detect no expression difference of gene A between case and control, but if we normalize by same gene length, we will detect the difference. That's my personal opinion which might need more discussion.

ADD REPLYlink modified 14 months ago • written 14 months ago by Tao300

Tao, Thanks for the detailed response. This is great and cleared my mind about this issue.

ADD REPLYlink written 14 months ago by biocool201820
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1503 users visited in the last hour