Good morning,
I'd like to perform differential expression analysis with some RNA-seq samples from GEO database (GSE99987) and obtain significant genes between different conditions. However, the count tables that are available on GEO show FPKM normalized counts. This normalization was done by Cuffdiff (v2.2.1) as it is mentioned by the authors.
So my question is: Should I use FPKM-normalized counts for differential expression analysis without applying any other normalization (such as TMM, DESeq size factor...) ??
P.D: I am confused because I've always read that FPKM normalization was for comparison of gene counts within the same sample. Whereas TMM, DESeq... normalizations were for comparison of gene counts between different conditions (samples).
Thank you in advance, Goren
Okey, I will do it from raw data. I didn't know that you could download RNAseq experiments from ENA... Thank you!!
One last question. The raw data of these project has 3-4 runs per sample... how should I deal with it? I have always worked with one .fastq file per sample. Is there any tutorial for this situation?
Thanks!
In the simplest case, you can combine them prior to quantification with
cat in1.fq.gz in2.fq.gz (...) > in_comb.fq.gz
and then proceed as usual. If these are technical, so sequencing replicates from different lanes, you will probably be fine. Alternatively, you can process them independently, and then do a principal component analysis to see if the lane replicates cluster together. This would be a quality check. There is a section in the DESeq2 manual about PCA and its input requirements (variance-stabilized counts). If this looks ok, you could simply sum up the counts per replicates.