Calculating FPKM and TPM by hand from htseq-count output?
0
0
Entering edit mode
2.5 years ago

Hello!

I am counting reads with htseq-count, and wasted some hours trying to find an extant software that would calculate FPKM and/or TPM from that output, so I wrote a script myself.

There is just one question mark - should the denominator (the sum of reads within the sample) be the sum of the reads that htseq-count successfully mapped to a feature, or the sum of reads in the input bam file?

And, if you happen to know, is it terrible if the effective length is set to 1 if it would've been calculated to negative?

Big thanks in advance!

Quick ref: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

RNA-Seq rna-seq htseq FPKM TPM • 2.9k views
ADD COMMENT
0
Entering edit mode

It is commonly the number of mapped reads as only these are relevant. Imagine you had 50% contaminant reads in your library, so 50% of the reads not reflecting your gene expression results. Taking the sum of all reads would roughly underestimate the true expression by somewhat 50%.

ADD REPLY
0
Entering edit mode

Do you mean 'mapped' as in mapped to the genome, or mapped to an exon in my gtf file? The difference between these two numbers is quite large. Several million reads!

ADD REPLY
0
Entering edit mode

Typically mapped = assigned to the exome. Commonly one calculated it from a count matrix where the sum is = the sum of the column and this represents mapped and successfully assigned features. What do you need thi FPKMs for? I hope not differential expression?

ADD REPLY
0
Entering edit mode

OK thanks!!

I've wondered about the FPKM's myself since TPM's seem better (but not even TPM's are wholly 'liked' by the community it seems), but the others in my research group said FPKM is the standard measure in our field (leukemia), so I just rolled with it. Somehow the libraries are supposed to be prepared in such a way that we can do inter-sample comparisons even without e.g. house-keeping genes. I'm new here though so can't tell you any details.

ADD REPLY
0
Entering edit mode

Neither of these features is a proper normalization technique for inter-sample comparison. Check the biostats literature on normalization technique comparisons. Per-million methods regularily fail or perform poorly. You should use a proper framework like edgeR or DESeq2 for normalization and differential expression.

ADD REPLY
0
Entering edit mode

Many things one should do, yes. PI wants FPKM, I oblige.

Thanks anyway!

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6