Analysis of GEO dataset normalized by FPKM
1
0
Entering edit mode
5.6 years ago
GSAENZDEPIP ▴ 30

Good morning,

I'd like to perform differential expression analysis with some RNA-seq samples from GEO database (GSE99987) and obtain significant genes between different conditions. However, the count tables that are available on GEO show FPKM normalized counts. This normalization was done by Cuffdiff (v2.2.1) as it is mentioned by the authors.

So my question is: Should I use FPKM-normalized counts for differential expression analysis without applying any other normalization (such as TMM, DESeq size factor...) ??

P.D: I am confused because I've always read that FPKM normalization was for comparison of gene counts within the same sample. Whereas TMM, DESeq... normalizations were for comparison of gene counts between different conditions (samples).

Thank you in advance, Goren

RNA-Seq • 3.1k views
ADD COMMENT
3
Entering edit mode
5.6 years ago
ATpoint 82k

FPKM is considered inferior to other normalization methods. If you want to use tools like DESeq2 or edgeR, you'll need raw counts. Probably you have to download the data and quantify them yourself. I suggest you use a tool like Salmon or Kallisto for transcript level quantification, then tximport to aggregate counts to the gene level, followed by differential analysis with DESeq2 or a similar framework. You can get the raw data from the ENA, following my tutorial.

ADD COMMENT
0
Entering edit mode

Okey, I will do it from raw data. I didn't know that you could download RNAseq experiments from ENA... Thank you!!

ADD REPLY
0
Entering edit mode

One last question. The raw data of these project has 3-4 runs per sample... how should I deal with it? I have always worked with one .fastq file per sample. Is there any tutorial for this situation?

Thanks!

ADD REPLY
0
Entering edit mode

In the simplest case, you can combine them prior to quantification with cat in1.fq.gz in2.fq.gz (...) > in_comb.fq.gz and then proceed as usual. If these are technical, so sequencing replicates from different lanes, you will probably be fine. Alternatively, you can process them independently, and then do a principal component analysis to see if the lane replicates cluster together. This would be a quality check. There is a section in the DESeq2 manual about PCA and its input requirements (variance-stabilized counts). If this looks ok, you could simply sum up the counts per replicates.

ADD REPLY

Login before adding your answer.

Traffic: 1745 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6