Question: Calculating FPKM and TPM by hand from htseq-count output?
gravatar for Joel Wallenius
10 days ago by
Joel Wallenius10 wrote:


I am counting reads with htseq-count, and wasted some hours trying to find an extant software that would calculate FPKM and/or TPM from that output, so I wrote a script myself.

There is just one question mark - should the denominator (the sum of reads within the sample) be the sum of the reads that htseq-count successfully mapped to a feature, or the sum of reads in the input bam file?

And, if you happen to know, is it terrible if the effective length is set to 1 if it would've been calculated to negative?

Big thanks in advance!

Quick ref:

rna-seq tpm fpkm htseq • 122 views
ADD COMMENTlink written 10 days ago by Joel Wallenius10

It is commonly the number of mapped reads as only these are relevant. Imagine you had 50% contaminant reads in your library, so 50% of the reads not reflecting your gene expression results. Taking the sum of all reads would roughly underestimate the true expression by somewhat 50%.

ADD REPLYlink written 10 days ago by ATpoint15k

Do you mean 'mapped' as in mapped to the genome, or mapped to an exon in my gtf file? The difference between these two numbers is quite large. Several million reads!

ADD REPLYlink modified 10 days ago • written 10 days ago by Joel Wallenius10

Typically mapped = assigned to the exome. Commonly one calculated it from a count matrix where the sum is = the sum of the column and this represents mapped and successfully assigned features. What do you need thi FPKMs for? I hope not differential expression?

ADD REPLYlink modified 10 days ago • written 10 days ago by ATpoint15k

OK thanks!!

I've wondered about the FPKM's myself since TPM's seem better (but not even TPM's are wholly 'liked' by the community it seems), but the others in my research group said FPKM is the standard measure in our field (leukemia), so I just rolled with it. Somehow the libraries are supposed to be prepared in such a way that we can do inter-sample comparisons even without e.g. house-keeping genes. I'm new here though so can't tell you any details.

ADD REPLYlink written 9 days ago by Joel Wallenius10

Neither of these features is a proper normalization technique for inter-sample comparison. Check the biostats literature on normalization technique comparisons. Per-million methods regularily fail or perform poorly. You should use a proper framework like edgeR or DESeq2 for normalization and differential expression.

ADD REPLYlink written 9 days ago by ATpoint15k

Many things one should do, yes. PI wants FPKM, I oblige.

Thanks anyway!

ADD REPLYlink written 9 days ago by Joel Wallenius10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1777 users visited in the last hour