Question: Difference between CPM and RPM for RNA-seq reada quantification
2
gravatar for v82masae
2.3 years ago by
v82masae140
v82masae140 wrote:

Hello everyone, I have a dataset of differentially expressed miRNAs.

I aligned my microRNA reads (10 millions per sample aprox) against reference genome, obtaining an average of 7.5 million aligned read per sample, and used FeatureCounts from RSubRead to count them asigning to reference miRNA GTF in miRBase, obtaining an average of 60% of asignment. Then, I used DESeq2 to perform differential expression analysis.

As said in Mullokandov et al. (2013), functional miRNAs should only be considered when expressed above 100-1000 RPM on average. I calculated CPM from FeatureCounts counts file and obtained average expression per group (2 groups contast) and expression per sample.

I would like to know if there is any difference between the concept of RPM and CPM, as I think the are quite the same, but not sure about that. I assumed miRNAs with a cpm expression below 100 to be spurious and noise, but I would like to be sure about that.

Any suggestions and clarifications?

Thanks

cpm rna-seq mirnas rpm • 4.6k views
ADD COMMENTlink modified 2.3 years ago by mforde841.2k • written 2.3 years ago by v82masae140

RPM=CPM=FPM (single end reads).

CPM cutoff of 100 sounds like a lot. But I have limited experience in miRNA sequence analysis. Here is one post for CPM cutoff for miRNA data.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Satyajeet Khare1.5k

I understand what that post says, but this is not refereing to biologically active miRNAs but to initially threshold filtering for eliminating reads with zeros and very small counts.

100-1000 or more RPM (or CPM but I am not sure XD) threshold is considered for withing the cell miRNA capability to drive a significant response in mRNA targeting and repressing gene expression and not just slight and spurious reactions. It is a matter of selecting really interesting DE miRNAs for discussion rather than for statistical analyses, in which I already applied such pre-filtering anyway.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by v82masae140

Following sentence was not clear

I assumed miRNAs with a cpm expression below 100 to be spurious and noise

Gave an impression that you are looking for cutoff for expression and not function.

ADD REPLYlink written 2.2 years ago by Satyajeet Khare1.5k

well, I assumed less than 100 CPMs to be a very low expression level so that the miRNA could not drive a significant function in repressing gene expression. If you have differential expression but with single or very scarce miRNA molecules, the effect is not expected to be such a thing to take into consideration. Maybe for profiling in cancer or so could be useful, but not for infering metabolic changes.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by v82masae140

No CPM, aka TPM, and F(R)PKM are not the same!! I think it is time for some StatQuest. You may also want to google for blogs and videos of Lior Pachter, one of the leading RNA-seq analysis guys on that matter.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by ATpoint26k

CPM is not TPM...

Question is the difference between counts per million and reads per million.

ADD REPLYlink written 2.3 years ago by mforde841.2k

I'm not asking for RPKM, TPM or FPKM sorry...

ADD REPLYlink written 2.3 years ago by v82masae140

Hi v82masae,

I have few queries regarding your miRNA requirement.

  • You might have used Illumina sequencing platform to sequence miRNA. After adapter trimming what was the length of the small RNA you kept?
  • Did you compare your sequenced miRNA against miRBase database?
  • What tools did you use to generate differentially expressed miRNAs?
ADD REPLYlink written 2.3 years ago by bioinforesearchquestions270
1

II did miRNA sequencing yes, with HiSeq 2500 Illumina, single-end.

I used Cutadapt for trimming adapters with an acceptance window of 15-30 nt (quite big, as reported windows are normally 18-25, but I wanted to retain big fragments for contrasting against circRNAs and precursor forms). I tried a smaller window and the output did not differ quite much, just a few thousand reads if not less.

After trimming, I aligned the reads against reference genome with about 80% mapping, and used FeatureCounts for counting reads in BAM files with an average 60% of successful assignment to GTF miRNA annotation file downloaded from miRBase v21 yes.

For differential expression I used DESeq2 and EdgeR, both giving quite similar results and kept DESeq2 just for department papers homogeneity as we are using DESeq2 for the majority of other papers from same project.

ADD REPLYlink written 2.3 years ago by v82masae140

Thanks, V82masae.

Based on this paper, I used the mirna_soft tool (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4138881/).

Even we sequenced using Illumina, 51 bp single-end. After trimming, I retained the length between 18-35nt. Then I compared the sequenced reads against the miRBase v21 database. For Bovine, I extracted bovine specific miRNA from mature.fa file from miRBase v21.

  • Did you do alignment with your species miRNAs or reference genome?

  • Did you use bowtie2 aligner?

I got the output file with miRNA id, sequence, count of unique matches and count of multi-matches. For one of the samples, total trimmed reads- 4,679,140. Unique matches to the bovine miRNA sequences (from miRBase) was 800,000 matches. unmatched reads - 3,796,387

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by bioinforesearchquestions270

I did alignment from trimmed reads to reference genome yes. As my model organism (pig) had a very poorly annotated miRNAome, I also aligned my reads agains human and mouse genome and did all the pipeline to detect miRNAs differentially expressed that were not described in the miRNA GTD annotation in pigs, considering that within phylogenetically close species, miRNA sequences are commonly conserved.

I tried some de novo miRNA finding algorithms like mirDeep2, but the output did not convince me, too noisy. I opted for finding manually some DE non-annotated miRNAs in human and mouse and looking for their positions in pig assembly, then doing folding to see if secondary structure was equivalent to expected miRNA. Found 12 new miRNAs that way, 7 of them were in fact new miRNAs as they were included in the recently new pig assembly that was released one month ago, and the other 5 were compatible with fragments of circRNAs, but considered as miRNAs in human and mouse, so maybe some fragmentation of circRNA could be involved.

I used bowtie1, it is better suited for small reads than bowtie2, without considering any mismatches in seed (not taking into consideration ismoMIRs), maximum of 20 multimappings and --best --strata mode with one best alignment per read reported.

ADD REPLYlink written 2.2 years ago by v82masae140

Hi V82masae,

I am working on both bovine and pig.

What do you think about this workflow, from miRBase I extracted bovine and pig related miRNAs from mature.fa (miRBase) file. Using miRNA_soft tool as per the research article I provided in the previous comment, I counted the reads mapped to bovine and pig miRNA sequences extracted from mature.fa. I got an output file with the following details:

miRNA id, miRNA sequence, Unique matches and Multi-matches.

I got similar output for all the samples. Then again using miRNA_soft tool, I compared the counts generated in the previous step to generate differential miRNA expression profiling.

I am a naive person in miRNA sequencing. Do you recommend this workflow?

What is miRNA GTD?

Can you share the pig reference genome genbank id or URL?

Why didn't you compare your sequenced reads with miRBase database instead you compared it with the pig, mouse and human reference genomes?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by bioinforesearchquestions270
0
gravatar for mforde84
2.3 years ago by
mforde841.2k
mforde841.2k wrote:
reads per million  = counts per feature / (total reads / 1000000)
counts per million = (counts per feature / total reads) * 1000000
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by mforde841.2k

Ain't these equations same, just different representations?

ADD REPLYlink written 2.3 years ago by Satyajeet Khare1.5k

Watch the StatQuest video I linked. You will see that the difference in the order of the operations is quiet remarkable on the result.

ADD REPLYlink written 2.3 years ago by ATpoint26k

Order of operations are different for TPM and RPKM. But the question is what is the difference between CPM and RPM. I think there is none and the equations above also do not appear different from each other.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Satyajeet Khare1.5k

So your claim is that...

counts per feature / (total reads / 1000000) == (counts per feature / total reads) * 1000000

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by mforde841.2k

Well, for me those two equations are the same. Let's say I have a total of 5 million reads per sample i, and 1000 counts per miRNA j, so as you wrote:

RPM = 1000 / (5000000 / 1000000) = 1000 / 5 = 200 CPM = (1000 / 5000000) * 1000000 = 0.0002 * 1000000 = 200

I used the RPM formula you posted for calculations, but contrasting it with cpm function in EdgeR package, the output is the same, so that's why I got confused about terminology. I initially tried counting reads in fastq files (like those are reads, not counts, and then looking for sequences matching the miRNAs of interest, but then I realised that this would make the alignment and counting procedure spurious, so that all statistics were based on counting by assigning to miRNA locations, not read numbers itself. Soooo, I just kept CPM considering as the same of RPM, but I am not sure about that, and that's why I asked...

ADD REPLYlink written 2.3 years ago by v82masae140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 737 users visited in the last hour