Calculating fpkm from raw counts
Entering edit mode
4.9 years ago
vinayjrao ▴ 220


I have a file consisting of raw counts. For my analysis, I want to normalize the counts by fpkm. The question is, how can I calculate fpkm from raw counts?

Also, upon a little reading I found that DESeq2 could do i, but it requires the file to be in DESeq format. Could someone guide me on what the format should look like?

Thanks in advance.

Edit: Thank you for pointing it out. Values of raw counts changed! (I had mistakenly copied the rpkm values)

P.S. Below is an example of my raw counts file -

gene      sample 1      sample 2     sample 3

A1BG    7589      8458      7945

A1CF    513      718      748

A2M     7648      8998      9778

RNA-Seq R fpkm • 15k views
Entering edit mode
24 months ago
Ahmed Alhendi ▴ 210

You can use countToFPKM package. This package provides an easy to use function to convert the read count matrix into FPKM matrix; following the equation in enter image description here

The fpkm() function requires three inputs to return FPKM as numeric matrix normalized by library size and feature length:

  • counts A numeric matrix of raw feature counts.
  • featureLength A numeric vector with feature lengths that can be obtained using
  • meanFragmentLength A numeric vector with mean fragment lengths,
    which can be calculate with Picard using CollectInsertSizeMetrics.


Entering edit mode

Can I use MEAN_INSERT_SIZE from picard as the meanFragmentLength directly ? Should I add the length of the adapter sequence that I used for trimming to the MEAN_INSERT_SIZE ? Please clarify on this.

Entering edit mode
4.9 years ago
caggtaagtat ★ 1.6k

Hi vina,

This is a quiet helpful article about normalized counts.

According to it, FPKM are calcualted like this:

These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM (or FPKM):

1)Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.

2)Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)

3)Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

However it is advised to use TPM instead. I think it depends on what you are planning to do.

Nevertheless, it seems to me like your data is already somewhat normalized, since your raw gene counts are not integer numbers.

Entering edit mode

Thanks for the reply. I have made an edit in the sample table (corrected a mistake). I have gone through the article, but it just mentions that the difference between rpkm and fpkm is that rpkm is for single-end data, while fpkm is for paired-end data.

Also, under what condition is TPM advisable?

I want to normalize the gene expression and classify the samples (1098 samples) into different molecular subtypes for breast cancer, and then study the expression of certain genes across the subtypes to check whether they have a conserved pattern of expression across subtypes.

Entering edit mode

Ok, so for manually comparing expression of the same genes within the same organism, I would always use TPM. However, I never did this manually, but always with the DESeq2 package of R since it does the normalization for you and comes with various additonal features.

Therefore, I would definitively give DESeq2 a try!

You mentioned, that your data is not in the right format. I think you could change that by transforming your column "gene" to rownames in R like that:

rownames(dataframe) <- dataframe$gene
dataframe$gene <- NULL

Now you have a legit count matrix wich you can use to create a DESeq dataset with the DESeq2 function DESeqDataSetFromMatrix. However you will need an additional dataframe first, which gives DESeq information about your samples, its called coldata in this case.

The very basic structur of a coldata object can be created somehow like this:

coldata <- data.frame(sample_names = c("sample1", "sample2", "sample3"), molecular_breastcancer_subtype = c("Subtype_A", "Subtype_B", "Subtype_A"))

rownames(coldata) <- coldata$sample_names
coldata$sample_names <- NULL

Be aware, that the order of your sampels in the coldata object matches the order of your samples in the countmatrix. Now you use the count matrix and the coldata dataframe to generate a DESeq2 object

dds <- DESeqDataSetFromMatrix(countData = dataframe,
                              colData = coldata,
                              design = ~ molecular_breastcancer_subtype)

You could also additonally consider multiple factors, than just "molecular_breastcancer_subtype" in you "design" parameter, if you entered the information in the coldata dataframe.

There are a few nice workflows on the bioconductor page like this

Entering edit mode

Thank you very much. I'll give it a try, but it certainly helped me understand better.

Edit: I considered your suggestion of taking TPM into account, and found that indeed it would be a better method to proceed with my analysis. Thank you.


Login before adding your answer.

Traffic: 2188 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6