Question

Calculating FPKM after htseq-count

1

Entering edit mode

7.2 years ago

Fill ▴ 70

I just want to make it clear.

I need to calculate FPKM. I use this formula: Normalized = [(raw_read_count)(10^9)] / [(gene_length)(XXXX)],

XXXX = the count of all reads that are aligned to protein-coding genes in that alignment.

How should I calculate XXXX? Is it just sum of all raw_read_counts after htseq-count (e.g. in R it will be XXXX <- sum(collumn_with_raw_red_counts)?

Thanks!

RNA-Seq FPKM htseq • 17k views

ADD COMMENT • link updated 5.1 years ago by Ahmed Alhendi ▴ 230 • written 7.2 years ago by Fill ▴ 70

1

Entering edit mode

Try this solution,

PERL solution:

https://github.com/santhilalsubhash/rpkm_rnaseq_count

R solution:

A: How to normalise read count per gene

ADD REPLY • link 7.2 years ago by EagleEye 7.5k

0

Entering edit mode

Thanks, but the question is about how to get Total count of all reads. Is it just sum of all counts? I ask it because I am trying to duplicate the results on GDC data portal and I can't do it because their Total count of all reads smaller than mine (which I calculate by sum())

ADD REPLY • link 7.2 years ago by Fill ▴ 70

0

Entering edit mode

Yes. Sum of all counts from individual samples (library size).

ADD REPLY • link 7.2 years ago by EagleEye 7.5k

0

Entering edit mode

It's "fragments per kilobase transcript per million reads", so you should divide by million, not multiply.

That said, are you sure FPKMs is something you need? I don't know your application, but for many purposes, there are better normalisation methods.

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

1

Entering edit mode

$\text{FPKM}_i = \dfrac{X_i}{ \left(\dfrac{\widetilde{l}_i}{10^3}\right) \left( \dfrac{N}{10^6} \right)} = \dfrac{X_i}{\widetilde{l}_i N} \cdot 10^9$ This is formula with effective length, you can see why I multiply by 10^9, not devide.

I am trying to duplicate the results on GDC data portal.

ADD REPLY • link 7.2 years ago by Fill ▴ 70

0

Entering edit mode

Ugh, you're totally right. Shame on me!

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Can you clarify duplicate part? Unless you are using the exact versions of software/genome build that may not be realistically possible.

ADD REPLY • link 7.2 years ago by GenoMax 141k

0

Entering edit mode

I am using exact versions software and genome like TCGA. And everything is good (I mean results from TCGA and my results after mRNA pipeline are identic. Except for N value for calculating FPKM (the count of all reads that are aligned to protein-coding genes in that alignment)

ADD REPLY • link 7.2 years ago by Fill ▴ 70

0

Entering edit mode

Is it possible that for paired end reads (fragments) they divided total reads by 2 ? (I know it is silly to think like that but can you match your numbers divided by two with TCGA total library sizes)

OR

While calculating FPKM (raw_reads/2) per gene.

ADD REPLY • link 7.2 years ago by EagleEye 7.5k

0

Entering edit mode

Thanks for your guess, but it doesn't work that way. My N values differs by 0.03x - 0.08x (x = TCGA N values).

ADD REPLY • link 7.2 years ago by Fill ▴ 70

0

Entering edit mode

Aligners may produce non-deterministic output (unless they are able to accept a seed). Perhaps that is what is causing this difference.

ADD REPLY • link 7.2 years ago by GenoMax 141k

0

Entering edit mode

5.1 years ago

Ahmed Alhendi ▴ 230

Try countToFPKM package. This package provides an easy to use function to convert the read count matrix into FPKM values normalised by library size and feature effective length. Implements the following equation:

$enter image description here$ .

The fpkm() function requires three inputs to return FPKM as numeric matrix normalized by library size and feature length:

counts A numeric matrix of raw feature counts.
featureLength A numeric vector with feature lengths that can be obtained using biomaRt.
meanFragmentLength A numeric vector with mean fragment lengths, which can be calculate with
Picard using CollectInsertSizeMetrics.

Also see https://github.com/AAlhendi1707/countToFPKM

ADD COMMENT • link 5.0 years ago by Ahmed Alhendi ▴ 230

score 6 · Accepted Answer · 2017-02-20

6

Entering edit mode

7.2 years ago

Fill ▴ 70

I've got answer from GDC portal:

Download GTF files used in HTSeq analyses: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files (GDC.h38 GENCODE v22 GTF)

Extract only protein-coding gene IDs: less gencode.v22.annotation.gtf | grep "\tgene\t" | grep protein_coding | cut -f9 | cut -f2 -d '"' > EnsembleIDsPCG.txt

Use resulting list to extract only protein-coding values from counts file: less CountFile.txt | grep -Ff ProteinCodingGeneList.txt > CountOnlyProt.txt

Sum the values of "CountOnlyProt.txt" and that will give you your denominator value.

My problem was that I counted reads for all genes, but should only for protein-coding.

P.S. thanks to GDC support team!

ADD COMMENT • link 7.2 years ago by Fill ▴ 70

0

Entering edit mode

Hi, I have stumbled upon your post trying to find out why I am not able to obtain the FPKM values provided by the GDC using the same raw count data. After following these steps it does not get any better. Just to be sure, you took the counts from the HTSeq files and the gene lengths from the GDC.h38 GENCODE v22 GTF, right? As for the N in the denominator, as explained above, it should be the sum of all protein coding genes...

ADD REPLY • link 6.5 years ago by CuriousGuy ▴ 90

0

Entering edit mode

You're right. Sum of all reads which align protein-coding genes. Gene lengths and counts are for all genes.

ADD REPLY • link 6.4 years ago by Fill ▴ 70

0

Entering edit mode

Does it make sense to also include genes that code for non-coding regulatory rna? To calculate N by summing counts in all exonic features?

ADD REPLY • link 5.0 years ago by gatollefson • 0