I'm analyzing some RNA-seq datasets for differential expression. We did not run the experiments ourselves, but the database where I got them from indicates that they were done adding additional spike-ins (ERCC92) for normalization. Originally, I did not use these spike-ins, and just went along with the default median ratio method in all the genes included in DESeq2. I saw many people arguing that using spike-ins is often not very robust, but given that I am analyzing data in which transcription-related factors are knocked-down and therefore a big change in the amount of mRNA is expected, I decided to reanalyze the data using the exogenous spike-ins for normalization. Another reason is that our results showed overexpression of many genes in specific experiments in which the opposite behavior was expected.
I couldn't find that much documentation on how to process RNA-seq data using spike-ins, but this is what I did:
- I joined the fasta files containing the reference genome and that containing the sequence for the spike-ins.
- I did the same with the respective .gtf annotation files.
- I proceeded to align everything as usual.
On the DE analysis, I just included an
estimateSizeFactors() step with the option control specifying the ERCC genes, the code as follows:
se_star <- DESeqDataSetFromHTSeqCount(sampleTable=sampletable, directory=data_path, design=~Condition) se_star <- estimateSizeFactors(se_star, controlGenes=ercc_genes) se_star2 <- DESeq(se_star) de <- results(object=se_star2, name="Condition_KO_vs_WT", ) de_shrink <- lfcShrink(dds=se_star2, coef="Condition_KO_vs_WT", type="apeglm")
In two of the three conditions I'm analyzing, I got a reasonable result in terms of DE genes (~2000), which is not that different from what I was achieving without the spike-ins. On the third one, however, the number of DE genes is unusually low (~50), which makes me think that something might be wrong with my analysis. I also calculated the size factors using the iterate normalization:
se_star <- estimateSizeFactors(se_star, controlGenes=ercc_genes, type='iterate')
Doing so, the number of DE genes in this experiment becomes more reasonable again, and the other experiments remain in a similar order of magnitude. However, I do not just want stick to this other method without any explanation. I also examined the PCA, correlation and distances heatmap doing as follows:
vsd <- vst(se_star2) sampleDistMatrix <- as.matrix(dist(t(assay(vsd)))) pheatmap(sampleDistMatrix) corMatrix <- cor(assay(vsd)) pheatmap(corMatrix) print(plotPCA(object=vsd, intgroup="Condition")) `
Using the default normalization method, I got the following results:
As you can see, in the distance matrix, the samples are not properly clustered by condition (WT / KO). However, when using the iterative method:
In this case, the samples are properly clustered in all cases, and a much larger percentage of the variance falls in the first principal component of the PCA analysis. This plot is consistent with what I was getting when performing the differential expression analysis in absence of the spike-in.
Am I doing everything alright? Could it be that the spike-in is not correct in one of the experiments? Can I just stick to the 'iterate' method?
Thank you very much,