Question: Analysis of GEO dataset normalized by FPKM
gravatar for GSAENZDEPIP
2.4 years ago by

Good morning,

I'd like to perform differential expression analysis with some RNA-seq samples from GEO database (GSE99987) and obtain significant genes between different conditions. However, the count tables that are available on GEO show FPKM normalized counts. This normalization was done by Cuffdiff (v2.2.1) as it is mentioned by the authors.

So my question is: Should I use FPKM-normalized counts for differential expression analysis without applying any other normalization (such as TMM, DESeq size factor...) ??

P.D: I am confused because I've always read that FPKM normalization was for comparison of gene counts within the same sample. Whereas TMM, DESeq... normalizations were for comparison of gene counts between different conditions (samples).

Thank you in advance, Goren

rna-seq • 1.7k views
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by GSAENZDEPIP20
gravatar for ATpoint
2.4 years ago by
ATpoint44k wrote:

FPKM is considered inferior to other normalization methods. If you want to use tools like DESeq2 or edgeR, you'll need raw counts. Probably you have to download the data and quantify them yourself. I suggest you use a tool like Salmon or Kallisto for transcript level quantification, then tximport to aggregate counts to the gene level, followed by differential analysis with DESeq2 or a similar framework. You can get the raw data from the ENA, following my tutorial.

ADD COMMENTlink written 2.4 years ago by ATpoint44k

Okey, I will do it from raw data. I didn't know that you could download RNAseq experiments from ENA... Thank you!!

ADD REPLYlink written 2.4 years ago by GSAENZDEPIP20

One last question. The raw data of these project has 3-4 runs per sample... how should I deal with it? I have always worked with one .fastq file per sample. Is there any tutorial for this situation?


ADD REPLYlink written 2.4 years ago by GSAENZDEPIP20

In the simplest case, you can combine them prior to quantification with cat in1.fq.gz in2.fq.gz (...) > in_comb.fq.gz and then proceed as usual. If these are technical, so sequencing replicates from different lanes, you will probably be fine. Alternatively, you can process them independently, and then do a principal component analysis to see if the lane replicates cluster together. This would be a quality check. There is a section in the DESeq2 manual about PCA and its input requirements (variance-stabilized counts). If this looks ok, you could simply sum up the counts per replicates.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by ATpoint44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1309 users visited in the last hour