How to compute TPM normalized values for TCGA miRNA data?
Entering edit mode
5 weeks ago
Ngrin • 0

Hello, I am trying to follow preprocessing steps explained in this publication (Individualized multi-omic pathway deviation scores using multiple factor analysis). As explained in their supplementary metarial, the authors followed below steps:

Normalized miRNA abundance was quantified as Reads per million microRNA mapped (RPMMM) values. RNAseq and miRNA-seq quantifications were TMM-normalized (Robinson and Oshlack, 2010), converted to counts per million (CPM), and log2-transformed.

As I have understood the following steps are done:

  1. Quantification of miRNA abundance as RPMMM values -> this is already done when downloading miRNA data from TCGA
  2. TMM normalization for RNA-seq and miRNA-seq data
  3. Conversion to counts per million (CPM)
  4. Log2 transformation

I have provided the below R code. However since this is my first experience working with miRNA data, I am not sure if everything is correctly implemented.

# Calculate TPM for RNA-seq data having a vector of gene lengths
x <- RNA_counts/geneLength
norm_RNA_counts <- t(t(x) * 1e6 / colSums(x))

# Calculate TPM for miRNA-seq data
library_sizes_miRNA <- colSums(miRNA_counts)
scaling_factors_miRNA <- median(library_sizes_miRNA) / library_sizes_miRNA
norm_miRNA_counts <- t(t(miRNA_counts) * scaling_factors_miRNA)

#Calculate CPM for RNA-seq
total_mapped_reads <- sum(norm_RNA_counts)
cpm_RNA <- norm_RNA_counts / total_mapped_reads * 1e6

#Calculate CPM for miRNA
total_mapped_reads <- sum(norm_miRNA_counts)
cpm_miRNA <- norm_miRNA_counts / total_mapped_reads * 1e6

#Log2 transformation
log2_cpm_RNA <- log2(cpm_RNA + 1)
log2_cpm_miRNA <- log2(cpm_miRNA + 1)

I have looked into many posts and got the TPM code for RNA-seq data. However for miRNA I could not find any specific one. I would appreciate any comment on the code if it has any issue.

TCGA normalization TPM miRNA • 363 views
Entering edit mode
5 weeks ago
dsull ★ 6.2k

TPM is irrelevant for miRNA-seq; CPM works fine.

TPM tries to adjust for length effects by dividing by gene length (e.g. reducing the impact of a 30K bp long transcript appearing more abundant than a 200 bp transcript even though, in reality, those transcripts might be equally expressed).

Part of the reason for this length bias is because longer transcripts are fragmented more and have more priming sites. Neither of these issues are relevent for miRNAs.

Entering edit mode

Yes, I found some similar ideas on biostar repository. So you think the authors have done something wrong? And I should only consider CPM?

Entering edit mode

When everything is the same size, correcting for size is pointless. It doesn't change the numbers much at all.


Login before adding your answer.

Traffic: 2206 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6