Shoud I use "assigned reads" or total reads (assigned + unassigned) to the RPKM value?
7.0 years ago

Dear all,

I'm recalculating the RPKM value of a RNASeq data on Rsubread through featureCounts function, and I'd like to know if should I use just the "assigned" reads or the total reads, including "unassigned ambiguity, multimapping..." (see below), in the RPKM formula. Looking for the answer in forums and in the Mortazaviet al.(2008), I've just find out that "N is the total number ofmappable reads in the experiment". So, could anybody please help in this regards?

RPKM = N/(L*T)


where:

N: number of reads assigned to a gene
L: length of the gene (kb)

                           T_reesei_F24.1_GGCTAC_L008_R1_001.cleanreads.fastq.gz_tophat2.F24h.1_accepted_hits.bam
Assigned                   32270962
Unassigned_Ambiguity       6896
Unassigned_MultiMapping    116803
Unassigned_NoFeatures      10751746
Unassigned_Unmapped        0
Unassigned_MappingQuality  0
Unassigned_FragementLength 0
Unassigned_Chimera         0


rpkm RNA-Seq R Rsubread • 3.9k views
Well, RPKM is calculated with respect to total number of mapped reads.

If you are working on uniquely mapped reads on genome then you should only consider Assigned reads.

7.0 years ago

If you include things like Unassigned_Ambiguity in the numerator, then include it in the denominator. Likewise with Unassigned_MultiMapping. Unassigned_NoFeatures could be left as part of the denominator, though I wouldn't include it since that'll bias things by sample quality. Having said that, I wouldn't calculate RPKMs at all, since they shouldn't be used in my opinion, by perhaps you have a good reason.

The statOmique consortium tested different normalization methods, RPKM is the worst one: http://bib.oxfordjournals.org/content/14/6/671.long

This really can't be emphasized enough. RPKMs really are a bad solution in search of a problem.

I entirely agree Devon.

But the problem is that , if we want to compare gene expression level e.g. across the cell lines then other than RPKM, what should we trust on?

I think RPKM is bad solution for smaller transcripts (<500bps).

You'd be better off with counts. The really tricky comparison is between organisms, but that's largely an unsolved problem (last I looked, at least).

In order to compare between the organisms, would it be better that if we consider only those reads which are mapping uniquely to both of the genomes.

then count the reads in features divided by total number of mapped reads

then normalize them by their quantiles

would then data be ready for comparison?

The issue is more how things might be meaningfully normalized when the gene sets aren't even the same. But anyway that's off topic to this post.

Yes, Certainly. I was just curious.

Thanks

