Question

Analysis of GEO dataset normalized by FPKM

0

Entering edit mode

5.6 years ago

GSAENZDEPIP ▴ 30

Good morning,

I'd like to perform differential expression analysis with some RNA-seq samples from GEO database (GSE99987) and obtain significant genes between different conditions. However, the count tables that are available on GEO show FPKM normalized counts. This normalization was done by Cuffdiff (v2.2.1) as it is mentioned by the authors.

So my question is: Should I use FPKM-normalized counts for differential expression analysis without applying any other normalization (such as TMM, DESeq size factor...) ??

P.D: I am confused because I've always read that FPKM normalization was for comparison of gene counts within the same sample. Whereas TMM, DESeq... normalizations were for comparison of gene counts between different conditions (samples).

Thank you in advance, Goren

RNA-Seq • 3.1k views

ADD COMMENT • link 5.6 years ago by GSAENZDEPIP ▴ 30

score 3 · Answer 1 · 2018-09-16

3

Entering edit mode

5.6 years ago

ATpoint 82k

FPKM is considered inferior to other normalization methods. If you want to use tools like DESeq2 or edgeR, you'll need raw counts. Probably you have to download the data and quantify them yourself. I suggest you use a tool like Salmon or Kallisto for transcript level quantification, then tximport to aggregate counts to the gene level, followed by differential analysis with DESeq2 or a similar framework. You can get the raw data from the ENA, following my tutorial.

ADD COMMENT • link 5.6 years ago by ATpoint 82k

0

Entering edit mode

Okey, I will do it from raw data. I didn't know that you could download RNAseq experiments from ENA... Thank you!!

ADD REPLY • link 5.6 years ago by GSAENZDEPIP ▴ 30

0

Entering edit mode

One last question. The raw data of these project has 3-4 runs per sample... how should I deal with it? I have always worked with one .fastq file per sample. Is there any tutorial for this situation?

Thanks!

ADD REPLY • link 5.6 years ago by GSAENZDEPIP ▴ 30

0

Entering edit mode

In the simplest case, you can combine them prior to quantification with cat in1.fq.gz in2.fq.gz (...) > in_comb.fq.gz and then proceed as usual. If these are technical, so sequencing replicates from different lanes, you will probably be fine. Alternatively, you can process them independently, and then do a principal component analysis to see if the lane replicates cluster together. This would be a quality check. There is a section in the DESeq2 manual about PCA and its input requirements (variance-stabilized counts). If this looks ok, you could simply sum up the counts per replicates.

ADD REPLY • link 5.6 years ago by ATpoint 82k