Question

Calculating FPKM and TPM by hand from htseq-count output?

0

Entering edit mode

5.0 years ago

Joel Wallenius ▴ 210

Hello!

I am counting reads with htseq-count, and wasted some hours trying to find an extant software that would calculate FPKM and/or TPM from that output, so I wrote a script myself.

There is just one question mark - should the denominator (the sum of reads within the sample) be the sum of the reads that htseq-count successfully mapped to a feature, or the sum of reads in the input bam file?

And, if you happen to know, is it terrible if the effective length is set to 1 if it would've been calculated to negative?

Big thanks in advance!

Quick ref: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

RNA-Seq rna-seq htseq FPKM TPM • 5.4k views

ADD COMMENT • link updated 8 months ago by ATpoint 82k • written 5.0 years ago by Joel Wallenius ▴ 210

0

Entering edit mode

It is commonly the number of mapped reads as only these are relevant. Imagine you had 50% contaminant reads in your library, so 50% of the reads not reflecting your gene expression results. Taking the sum of all reads would roughly underestimate the true expression by somewhat 50%.

ADD REPLY • link 5.0 years ago by ATpoint 82k

0

Entering edit mode

Do you mean 'mapped' as in mapped to the genome, or mapped to an exon in my gtf file? The difference between these two numbers is quite large. Several million reads!

ADD REPLY • link 5.0 years ago by Joel Wallenius ▴ 210

0

Entering edit mode

Typically mapped = assigned to the exome. Commonly one calculated it from a count matrix where the sum is = the sum of the column and this represents mapped and successfully assigned features. What do you need thi FPKMs for? I hope not differential expression?

ADD REPLY • link 5.0 years ago by ATpoint 82k

0

Entering edit mode

OK thanks!!

I've wondered about the FPKM's myself since TPM's seem better (but not even TPM's are wholly 'liked' by the community it seems), but the others in my research group said FPKM is the standard measure in our field (leukemia), so I just rolled with it. Somehow the libraries are supposed to be prepared in such a way that we can do inter-sample comparisons even without e.g. house-keeping genes. I'm new here though so can't tell you any details.

ADD REPLY • link 5.0 years ago by Joel Wallenius ▴ 210

0

Entering edit mode

Neither of these features is a proper normalization technique for inter-sample comparison. Check the biostats literature on normalization technique comparisons. Per-million methods regularily fail or perform poorly. You should use a proper framework like edgeR or DESeq2 for normalization and differential expression.

ADD REPLY • link 5.0 years ago by ATpoint 82k

0

Entering edit mode

Many things one should do, yes. PI wants FPKM, I oblige.

Thanks anyway!

ADD REPLY • link 5.0 years ago by Joel Wallenius ▴ 210

0

Entering edit mode

Hi Joel Wallenius , I'm having this same issue. I'm trying to convert between the htseq-count and TPM specifically. Would you be okay with sharing your script for this?

Thank you!

ADD REPLY • link 8 months ago by AHerik ▴ 20

0

Entering edit mode

Hi, do you still need help? The script would be in an old zip somewheres... might take a while to dig out!

ADD REPLY • link 8 months ago by Joel Wallenius ▴ 210

0

Entering edit mode

There are dozens of answers at biostars on how to convert raw counts to TPM, for example: Raw counts to TPM in R

ADD REPLY • link 8 months ago by ATpoint 82k