While I think it is kind of hard to precisely prove, I don't think the ERCC spike-in information is really as crucial as assumed in the question.
However, I really like the comments from first-hand experience such as "From my experience supervised svaseq behaves similarly with RUVg," and I thought it was really important for eldronzhou to point out "BTW, in RUV paper the authors suggest that ERCC spike-in does not behave like endogenous genes. Global [normalization] based on ERCC spike-in can lead to poor [normalized] counts."
So, for what it is worth, here is my input:
1) My personal preference is to test multivariate models for differenital expression for multiple methods (such as edgeR, DESeq2, limma-voom, etc.) rather than having a corrected normalization upstream of differential expression (although I do test visualization with simple adjustments after differential expression, such as centering expression among groups that you want to adjust)
2) You still need to critically assess the supervised normalization strategies. For example, if you use ComBat to adjust expression in a way that essentially makes your samples show the clustering that you want, you should be weary about over-correction (that may make results less robust and harder to re-produce in other studies). I would recommend checking expression before and after any sort of adjustment. For example, maybe it is helpful to center expression by batch, but can you check both expression types and see that your conclusions would be similar (So, do you see similar trends in each batch? Or, did your normalization do something like change the direction of the gene expression change within the batch, which would need to be examined more carefully).
3) While I'm sure you can find a variety of opinions, here are some other references that I believe indicate normalization with ERCC spike-ins can be problematic:
Paper #1 (Qing et al. 2013): "[ERCC] fluctuation may prevent the ERCC controls from being used for cross-sample normalization in RNA-Seq"
Paper #2 (SEQC 2014 Nature paper): "We observed, however, that the fraction of reads aligning to ERCC spike-ins for a given sample varied widely between libraries and platforms, with measured ERCC ranges of 1–2.5% for HiSeq 2000 and 2.5–4.7% for SOLiD, with a clear ‘library effect’ observed for all sites and platforms, affecting reproducibility"
-->My opinion on this is that you should probably try to define the most direct adjustment possible. In other words, if you really want to adjust for is the total number of detected genes, then maybe something like TMM normalization (from the overall distribution for all quantified genes) is better than trying to normalize based upon manually added ERCC spike-ins (although, even if it helps overall, I think it should be understood that the TMM normalization may not be perfect, and may still also have some amount of over-correction to an extent that you should evaluate for each project).
(for #2, I should thank members of the Bioinformatics Forum at City of Hope for having a discussion that caused me to, at least briefly, read these citations in the context of the ERCC spike-ins recently, even though these represent my individual opinions and agreement shouldn't be assumed for all members of the discussion group; I also think the Qing et al. 2013 paper had a relatively small number of citations for good paper that provided some important/interesting points that I found were fairly easy to understand).