Question: How should I normalized RNA-seq data to draw the gene expression time series curve?
gravatar for jfhuang.dg
4.7 years ago by
jfhuang.dg30 wrote:


I am processing insect development paired-end RNA-seq data.

Eight time points were select to prepare samples, several whole insects were put together to extract the total mRNA in each time point. As a result, I get eight RNA-seq data from the eight development time:

Egg -> 24 hour -> 48 hour -> 72 hour -> 96 hour -> Prepupa -> Pupa -> Adult 

I want to draw the gene expresion across the time, like:

But the question is how shouw should I normalized my data across samples to make the expression comparable? I using three different methods to caculate the expression value, but the results confuse me a lot. I will post the gene expression correlation plots to describe.

(1) Map RNA-seq data to genome using tophat, use cuffdiff to generate the fpkm value (As cuffdiff will normalized data across samples), and get very weak correlation across samples.

(2) Map data to genome using tophat, generate rpkm value using the sam output (use from Sandberg lab),and get the following figure:

(3) Use RSEM to caculate TPM vaule for each gene, and plot


Results between (1) and (2),(3) is so different. The (2) and (3) results get the similar correlation between samples,but the question is, for example, the first sample (Egg) and the last sample (Adult), the insect's phenotype is totally different, but the gene expression correlation seems too high (>0.9)

Could any guy give me a comment of my results? If the expression value need to be normalized across samples after rpkm/tpm or before this process (especially for (2) and (3) ) ? with method I should use?


rna-seq • 5.8k views
ADD COMMENTlink modified 4.7 years ago by Amitm1.9k • written 4.7 years ago by jfhuang.dg30

Sorry for the watermark of the pictures, I just could not find a good place to post my figures ......

ADD REPLYlink written 4.7 years ago by jfhuang.dg30

Did you try logarithmic transformation before calculate the correlations?

ADD REPLYlink written 4.7 years ago by kks0

I try to take log10 in my RSEM method result,it looks different.

But I don't know if it means my data is useable or not.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by jfhuang.dg30
gravatar for Amitm
4.7 years ago by
Amitm1.9k wrote:


This may seem very primitive but after log-trans, make a boxplot and see the data. I find this very intuitive before puzzling over concordance or correlation.

Also, some normalization is important before drawing conclusions. BioConductor/ R has good packages like DESeq2 for RNA-seq data. Though I have seen them to not perform so well with non-replicate data as yours is.

I can suggest some basic steps which I do to reduce variation. RNA-seq has large no. of genes/ transcripts with 0 or near 0 value (Rider here - experienced with data from human tissues & cell lines only).

1) Calculate avg. exp. value for each gene across all samples.

2) Sort this vector and apply a/an (arbitrary threshold). The idea is to remove the genes which have basal exp. value across all samples. If you make a density plot of this vector in R, it would be clear as to where a cutoff could be made.

3) After this, with the selected gene list, calculate std. dev. and again do a selection process for highly divergent ones.

Ultimately you would be left with a gene set that is non-basal and "responding" to your biological question.

Then calculate correlation or perform Clustering to discover groups of genes.

With non-replicate RNA-seq data there aren't any rigorous statistical methods out there. Above is my take on making the best of available data.

ADD COMMENTlink written 4.7 years ago by Amitm1.9k
gravatar for Antonio R. Franco
4.7 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.3k wrote:

Take a look to this RPub to check if it is useful for you

ADD COMMENTlink written 4.7 years ago by Antonio R. Franco4.3k

An excellent tutorial! The figures inspire me a lot!

Array data usually use log2 value, and the different is array data of samples in different time usually hold in single array, and they will normalized together.

I am not sure if RNA-seq data (FPKM/RPKM/TPM value) should treat more process to normalized among samples. The result of my data (RPKM and TPM) looks unreasonable, as the correlation between the first and the last times is quite high. But using cuffdiff result, they looks quite different.

If RPKM/TPM result need not to treat further(normalized among samples), then it means there is some problem of my data (It may be wrong).

The data is from my collaborator.So I must make sure that the data processing is right, to decide if the problem happened in my data processing or happened before sequencing (design of the wet experiment).

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by jfhuang.dg30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 781 users visited in the last hour