Question: Calculating fpkm from raw counts
gravatar for vinayjrao
2.7 years ago by
inStem, India
vinayjrao200 wrote:


I have a file consisting of raw counts. For my analysis, I want to normalize the counts by fpkm. The question is, how can I calculate fpkm from raw counts?

Also, upon a little reading I found that DESeq2 could do i, but it requires the file to be in DESeq format. Could someone guide me on what the format should look like?

Thanks in advance.

Edit: Thank you for pointing it out. Values of raw counts changed! (I had mistakenly copied the rpkm values)

P.S. Below is an example of my raw counts file -

gene      sample 1      sample 2     sample 3

A1BG    7589      8458      7945

A1CF    513      718      748

A2M     7648      8998      9778

rna-seq fpkm R • 6.2k views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by vinayjrao200
gravatar for caggtaagtat
2.7 years ago by
caggtaagtat1.4k wrote:

Hi vina,

This is a quiet helpful article about normalized counts.

According to it, FPKM are calcualted like this:

These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM (or FPKM):

1)Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.

2)Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)

3)Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

However it is advised to use TPM instead. I think it depends on what you are planning to do.

Nevertheless, it seems to me like your data is already somewhat normalized, since your raw gene counts are not integer numbers.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by caggtaagtat1.4k

Thanks for the reply. I have made an edit in the sample table (corrected a mistake). I have gone through the article, but it just mentions that the difference between rpkm and fpkm is that rpkm is for single-end data, while fpkm is for paired-end data.

Also, under what condition is TPM advisable?

I want to normalize the gene expression and classify the samples (1098 samples) into different molecular subtypes for breast cancer, and then study the expression of certain genes across the subtypes to check whether they have a conserved pattern of expression across subtypes.

ADD REPLYlink written 2.7 years ago by vinayjrao200

Ok, so for manually comparing expression of the same genes within the same organism, I would always use TPM. However, I never did this manually, but always with the DESeq2 package of R since it does the normalization for you and comes with various additonal features.

Therefore, I would definitively give DESeq2 a try!

You mentioned, that your data is not in the right format. I think you could change that by transforming your column "gene" to rownames in R like that:

rownames(dataframe) <- dataframe$gene
dataframe$gene <- NULL

Now you have a legit count matrix wich you can use to create a DESeq dataset with the DESeq2 function DESeqDataSetFromMatrix. However you will need an additional dataframe first, which gives DESeq information about your samples, its called coldata in this case.

The very basic structur of a coldata object can be created somehow like this:

coldata <- data.frame(sample_names = c("sample1", "sample2", "sample3"), molecular_breastcancer_subtype = c("Subtype_A", "Subtype_B", "Subtype_A"))

rownames(coldata) <- coldata$sample_names
coldata$sample_names <- NULL

Be aware, that the order of your sampels in the coldata object matches the order of your samples in the countmatrix. Now you use the count matrix and the coldata dataframe to generate a DESeq2 object

dds <- DESeqDataSetFromMatrix(countData = dataframe,
                              colData = coldata,
                              design = ~ molecular_breastcancer_subtype)

You could also additonally consider multiple factors, than just "molecular_breastcancer_subtype" in you "design" parameter, if you entered the information in the coldata dataframe.

There are a few nice workflows on the bioconductor page like this

ADD REPLYlink written 2.7 years ago by caggtaagtat1.4k

Thank you very much. I'll give it a try, but it certainly helped me understand better.

Edit: I considered your suggestion of taking TPM into account, and found that indeed it would be a better method to proceed with my analysis. Thank you.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by vinayjrao200
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2497 users visited in the last hour