Question: Calculating fpkm from raw counts
gravatar for vinayjrao
14 months ago by
vinayjrao130 wrote:


I have a file consisting of raw counts. For my analysis, I want to normalize the counts by fpkm. The question is, how can I calculate fpkm from raw counts?

Also, upon a little reading I found that DESeq2 could do i, but it requires the file to be in DESeq format. Could someone guide me on what the format should look like?

Thanks in advance.

Edit: Thank you for pointing it out. Values of raw counts changed! (I had mistakenly copied the rpkm values)

P.S. Below is an example of my raw counts file -

gene      sample 1      sample 2     sample 3

A1BG    7589      8458      7945

A1CF    513      718      748

A2M     7648      8998      9778

rna-seq fpkm R • 2.5k views
ADD COMMENTlink modified 14 months ago • written 14 months ago by vinayjrao130
gravatar for caggtaagtat
14 months ago by
caggtaagtat650 wrote:

Hi vina,

This is a quiet helpful article about normalized counts.

According to it, FPKM are calcualted like this:

These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM (or FPKM):

1)Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.

2)Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)

3)Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

However it is advised to use TPM instead. I think it depends on what you are planning to do.

Nevertheless, it seems to me like your data is already somewhat normalized, since your raw gene counts are not integer numbers.

ADD COMMENTlink modified 14 months ago • written 14 months ago by caggtaagtat650

Thanks for the reply. I have made an edit in the sample table (corrected a mistake). I have gone through the article, but it just mentions that the difference between rpkm and fpkm is that rpkm is for single-end data, while fpkm is for paired-end data.

Also, under what condition is TPM advisable?

I want to normalize the gene expression and classify the samples (1098 samples) into different molecular subtypes for breast cancer, and then study the expression of certain genes across the subtypes to check whether they have a conserved pattern of expression across subtypes.

ADD REPLYlink written 14 months ago by vinayjrao130

Ok, so for manually comparing expression of the same genes within the same organism, I would always use TPM. However, I never did this manually, but always with the DESeq2 package of R since it does the normalization for you and comes with various additonal features.

Therefore, I would definitively give DESeq2 a try!

You mentioned, that your data is not in the right format. I think you could change that by transforming your column "gene" to rownames in R like that:

rownames(dataframe) <- dataframe$gene
dataframe$gene <- NULL

Now you have a legit count matrix wich you can use to create a DESeq dataset with the DESeq2 function DESeqDataSetFromMatrix. However you will need an additional dataframe first, which gives DESeq information about your samples, its called coldata in this case.

The very basic structur of a coldata object can be created somehow like this:

coldata <- data.frame(sample_names = c("sample1", "sample2", "sample3"), molecular_breastcancer_subtype = c("Subtype_A", "Subtype_B", "Subtype_A"))

rownames(coldata) <- coldata$sample_names
coldata$sample_names <- NULL

Be aware, that the order of your sampels in the coldata object matches the order of your samples in the countmatrix. Now you use the count matrix and the coldata dataframe to generate a DESeq2 object

dds <- DESeqDataSetFromMatrix(countData = dataframe,
                              colData = coldata,
                              design = ~ molecular_breastcancer_subtype)

You could also additonally consider multiple factors, than just "molecular_breastcancer_subtype" in you "design" parameter, if you entered the information in the coldata dataframe.

There are a few nice workflows on the bioconductor page like this

ADD REPLYlink written 14 months ago by caggtaagtat650

Thank you very much. I'll give it a try, but it certainly helped me understand better.

Edit: I considered your suggestion of taking TPM into account, and found that indeed it would be a better method to proceed with my analysis. Thank you.

ADD REPLYlink modified 14 months ago • written 14 months ago by vinayjrao130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1544 users visited in the last hour