Question

Calculating FPKM values from HT-counts

0

Entering edit mode

7.0 years ago

rbronste ▴ 420

Hi,

Interested in converting some raw HT-counts to FPKMs and wondering how everyone approaches this problem? I have a few ideas but was interested in the consensus view. Thank you!

Rob.

FPKM RPKM RNA-Seq • 2.8k views

ADD COMMENT • link updated 7.0 years ago by Devon Ryan 104k • written 7.0 years ago by rbronste ▴ 420

score 1 · Answer 1 · 2017-05-12

1

Entering edit mode

7.0 years ago

Devon Ryan 104k

It's best not to, but if you really must, then a common approach is to take either the median transcript length or the length of the "union gene model" as the K in the FPKM. Aside from that, it's counts / length (in KB) / 1 million. Note that if you want to compare between samples, that you should use normalized counts, since FPKMs made from raw counts are inappropriate for comparison between samples (among the reasons it's best not to bother with FPKMs).

ADD COMMENT • link 7.0 years ago by Devon Ryan 104k

0

Entering edit mode

Great thanks, any particular place its easiest to download list transcript lengths? I guess UCSC table browser?

ADD REPLY • link 6.9 years ago by rbronste ▴ 420

1

Entering edit mode

I usually just calculate it from GTF files, but if you can get it from UCSC then all the easier :)

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

I usually do as well, just wondering if it was just somewhere in UCSC. I usually do the following in R:

library(GenomicFeatures) txdb <- makeTxDbFromGFF("test.gtf", format="gtf") trans <- transcripts(txdb, columns=c("GENEID")) df <- data.frame(gene=trans$GENEID, len=width(trans))

I am wondering though, given that HTseq outputs counts of genes irrespective of alternative transcripts, how best to pair the lengths from the R commands above with HTseq counts?

Thanks.

ADD REPLY • link 6.9 years ago by rbronste ▴ 420

0

Entering edit mode

Hmm, won't that be the length of the transcript on the genome, rather than their transcribed lengths? I have an old script that will produce the "union gene model" length from a GTF file, which I suspect will be a bit more reasonable. That would also match better with what htseq-count and featureCounts are doing.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

I believe you're right. Giving the script you linked a shot as well. Just so I understand completely what is background fasta you're using here? Also in this instance the GC length is being used as a proxy for transcribed length? Thanks for the help!

ADD REPLY • link 6.9 years ago by rbronste ▴ 420

0

Entering edit mode

You can remove the GC related stuff and the fasta file, you don't need that. If you look at the README file in that directory you'll note that this was really made for CQN, which needs GC content.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon, could you please explain a little more why is it best not to convert raw RNASeq counts to FPKM values? You also said "if you want to compare between samples, that you should use normalized counts, since FPKMs made from raw counts are inappropriate for comparison between samples"; what does the "normalized counts" stand for there? I was thinking FPKM are a kind of normalized counts.. Do you mean CPM, or generally, do you mean normalization for the library size?

ADD REPLY • link 6.5 years ago by ebrudermanver ▴ 100

0

Entering edit mode

Regarding FPKMs, the reasons behind this have been repeated so often then I won't bother doing so again. Please simply search this site for them.

Regarding "normalized counts", please search for "RNAseq normalized counts" with google. In essence, these are any counts resulting after correcting for library size in a robust manner (i.e., not FPKM or CPM).

ADD REPLY • link 6.5 years ago by Devon Ryan 104k