Question

Problems using ERCC spike-ins for normalization in DESeq2

4

Entering edit mode

23 months ago

manuel.fernandez ▴ 50

Hello everyone,

I'm analyzing some RNA-seq datasets for differential expression. We did not run the experiments ourselves, but the database where I got them from indicates that they were done adding additional spike-ins (ERCC92) for normalization. Originally, I did not use these spike-ins, and just went along with the default median ratio method in all the genes included in DESeq2. I saw many people arguing that using spike-ins is often not very robust, but given that I am analyzing data in which transcription-related factors are knocked-down and therefore a big change in the amount of mRNA is expected, I decided to reanalyze the data using the exogenous spike-ins for normalization. Another reason is that our results showed overexpression of many genes in specific experiments in which the opposite behavior was expected.

I couldn't find that much documentation on how to process RNA-seq data using spike-ins, but this is what I did:

I joined the fasta files containing the reference genome and that containing the sequence for the spike-ins.
I did the same with the respective .gtf annotation files.
I proceeded to align everything as usual.

On the DE analysis, I just included an estimateSizeFactors() step with the option control specifying the ERCC genes, the code as follows:

se_star <- DESeqDataSetFromHTSeqCount(sampleTable=sampletable, directory=data_path, design=~Condition)
se_star <- estimateSizeFactors(se_star, controlGenes=ercc_genes)
se_star2 <- DESeq(se_star)

de <- results(object=se_star2, name="Condition_KO_vs_WT", )
de_shrink <- lfcShrink(dds=se_star2, coef="Condition_KO_vs_WT", type="apeglm")

In two of the three conditions I'm analyzing, I got a reasonable result in terms of DE genes (~2000), which is not that different from what I was achieving without the spike-ins. On the third one, however, the number of DE genes is unusually low (~50), which makes me think that something might be wrong with my analysis. I also calculated the size factors using the iterate normalization:

se_star <- estimateSizeFactors(se_star, controlGenes=ercc_genes, type='iterate')

Doing so, the number of DE genes in this experiment becomes more reasonable again, and the other experiments remain in a similar order of magnitude. However, I do not just want stick to this other method without any explanation. I also examined the PCA, correlation and distances heatmap doing as follows:

vsd <- vst(se_star2)
sampleDistMatrix <- as.matrix(dist(t(assay(vsd))))
pheatmap(sampleDistMatrix)

corMatrix <- cor(assay(vsd))
pheatmap(corMatrix)

print(plotPCA(object=vsd, intgroup="Condition"))
`

Using the default normalization method, I got the following results:

Default normalization - spikein

As you can see, in the distance matrix, the samples are not properly clustered by condition (WT / KO). However, when using the iterative method:

Iterative normalization - spikein

In this case, the samples are properly clustered in all cases, and a much larger percentage of the variance falls in the first principal component of the PCA analysis. This plot is consistent with what I was getting when performing the differential expression analysis in absence of the spike-in.

Am I doing everything alright? Could it be that the spike-in is not correct in one of the experiments? Can I just stick to the 'iterate' method?

Thank you very much,
Manuel F.

Spike-ins Differential-expression DESeq2 RNA-seq • 2.9k views

ADD COMMENT • link 22 months ago by manuel.fernandez ▴ 50

0

Entering edit mode

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] DESeq2_1.30.1               SummarizedExperiment_1.20.0 Biobase_2.50.0              MatrixGenerics_1.2.1
 [5] matrixStats_0.63.0          GenomicRanges_1.42.0        GenomeInfoDb_1.26.7         IRanges_2.24.1
 [9] S4Vectors_0.28.1            BiocGenerics_0.36.1         Polychrome_1.5.1            MetBrewer_0.2.0
[13] patchwork_1.1.2             ggrepel_0.9.3               ggalluvial_0.12.5           pheatmap_1.0.12
[17] RColorBrewer_1.1-3          data.table_1.14.6           stringi_1.7.12              lubridate_1.9.2
[21] forcats_1.0.0               stringr_1.5.0               readr_2.1.4                 tidyr_1.3.0
[25] tibble_3.1.8                ggplot2_3.4.1               tidyverse_2.0.0             purrr_1.0.1
[29] dplyr_1.1.0                 readxl_1.4.2

loaded via a namespace (and not attached):
 [1] bitops_1.0-7           bit64_4.0.5            httr_1.4.5             numDeriv_2016.8-1.1    tools_4.0.3
 [6] utf8_1.2.3             R6_2.5.1               DBI_1.1.3              colorspace_2.1-0       apeglm_1.12.0
[11] withr_2.5.0            tidyselect_1.2.0       bit_4.0.5              compiler_4.0.3         cli_3.6.0
[16] DelayedArray_0.16.3    labeling_0.4.2         scales_1.2.1           mvtnorm_1.1-3          genefilter_1.72.1
[21] XVector_0.30.0         pkgconfig_2.0.3        fastmap_1.1.1          bbmle_1.0.25           rlang_1.0.6
[26] rstudioapi_0.14        RSQLite_2.3.0          generics_0.1.3         farver_2.1.1           BiocParallel_1.24.1
[31] RCurl_1.98-1.10        magrittr_2.0.3         GenomeInfoDbData_1.2.4 Matrix_1.5-3           Rcpp_1.0.10
[36] munsell_0.5.0          fansi_1.0.4            lifecycle_1.0.3        scatterplot3d_0.3-43   MASS_7.3-53
[41] zlibbioc_1.36.0        plyr_1.8.8             grid_4.0.3             blob_1.2.3             bdsmatrix_1.3-6
[46] lattice_0.20-41        splines_4.0.3          annotate_1.68.0        hms_1.1.2              locfit_1.5-9.1
[51] pillar_1.8.1           geneplotter_1.68.0     XML_3.99-0.13          glue_1.6.2             vctrs_0.5.2
[56] tzdb_0.3.0             cellranger_1.1.0       gtable_0.3.1           cachem_1.0.7           emdbook_1.3.12
[61] xfun_0.37              xtable_1.8-4           coda_0.19-4            survival_3.2-7         tinytex_0.44
[66] AnnotationDbi_1.52.0   memoise_2.0.1          timechange_0.2.0       ellipsis_0.3.2

ADD REPLY • link updated 23 months ago by Ram 45k • written 23 months ago by manuel.fernandez ▴ 50

0

Entering edit mode

since you have the spike in controls, can you verify that you do get the expected foldchanges between the different ERCC transcripts both within and across samples, that would be a way that is independent of the size estimation method.

ADD REPLY • link 23 months ago by Istvan Albert 102k

score 4 · Answer 1 · 2023-08-25

4

Entering edit mode

23 months ago

jared.andrews07 ★ 19k

Take a look at RUVSeq, it's the most simple way I've found to incorporate the spike-ins, and it plays nicely with DESeq2.

ADD COMMENT • link 23 months ago by jared.andrews07 ★ 19k

0

Entering edit mode

Hi,

Thanks a lot, I will have a look and try to reprocess my data using it, maybe I can correct for some bias if that is indeed the source of my errors.

ADD REPLY • link 22 months ago by manuel.fernandez ▴ 50

0

Entering edit mode

Hi jared.andrews07

I read the documentation for the package and implemented it, but I'm still not sure I trust the results. Does RUVseq account for big changes in total amount of RNA between samples? Or does it only correct differences between the amount of ERCC in each one of the replicates?

Thanks!

ADD REPLY • link 22 months ago by manuel.fernandez ▴ 50

0

Entering edit mode

but I'm still not sure I trust the results

That is smart.

In general, we don't use the spike-ins unless there is a clear reason to do so. MA plots are usually a pretty clear way to tell if there is a global shift in one group versus another. I think we had one specific instance where it felt appropriate to normalize via spike-in rather than "traditional" standard methods. We always include them (just in case), but rarely use them in the analysis.

ADD REPLY • link 22 months ago by jared.andrews07 ★ 19k