Question: Calculating FPKM and TPM by hand from htseq-count output?
0
gravatar for Joel Wallenius
7 months ago by
Sweden
Joel Wallenius70 wrote:

Hello!

I am counting reads with htseq-count, and wasted some hours trying to find an extant software that would calculate FPKM and/or TPM from that output, so I wrote a script myself.

There is just one question mark - should the denominator (the sum of reads within the sample) be the sum of the reads that htseq-count successfully mapped to a feature, or the sum of reads in the input bam file?

And, if you happen to know, is it terrible if the effective length is set to 1 if it would've been calculated to negative?

Big thanks in advance!

Quick ref: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

rna-seq tpm fpkm htseq • 636 views
ADD COMMENTlink modified 6 months ago by Biostar ♦♦ 20 • written 7 months ago by Joel Wallenius70

It is commonly the number of mapped reads as only these are relevant. Imagine you had 50% contaminant reads in your library, so 50% of the reads not reflecting your gene expression results. Taking the sum of all reads would roughly underestimate the true expression by somewhat 50%.

ADD REPLYlink written 7 months ago by ATpoint26k

Do you mean 'mapped' as in mapped to the genome, or mapped to an exon in my gtf file? The difference between these two numbers is quite large. Several million reads!

ADD REPLYlink modified 7 months ago • written 7 months ago by Joel Wallenius70

Typically mapped = assigned to the exome. Commonly one calculated it from a count matrix where the sum is = the sum of the column and this represents mapped and successfully assigned features. What do you need thi FPKMs for? I hope not differential expression?

ADD REPLYlink modified 7 months ago • written 7 months ago by ATpoint26k

OK thanks!!

I've wondered about the FPKM's myself since TPM's seem better (but not even TPM's are wholly 'liked' by the community it seems), but the others in my research group said FPKM is the standard measure in our field (leukemia), so I just rolled with it. Somehow the libraries are supposed to be prepared in such a way that we can do inter-sample comparisons even without e.g. house-keeping genes. I'm new here though so can't tell you any details.

ADD REPLYlink written 7 months ago by Joel Wallenius70

Neither of these features is a proper normalization technique for inter-sample comparison. Check the biostats literature on normalization technique comparisons. Per-million methods regularily fail or perform poorly. You should use a proper framework like edgeR or DESeq2 for normalization and differential expression.

ADD REPLYlink written 7 months ago by ATpoint26k

Many things one should do, yes. PI wants FPKM, I oblige.

Thanks anyway!

ADD REPLYlink written 7 months ago by Joel Wallenius70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 834 users visited in the last hour