“Within a single experiment” is O.K. with caveats. Comparing the number up versus the number down inside a single, well-designed experiment is reasonable as a high-level readout. However, note these counts can (and often do) reflect more than biology—things like power, dispersion, thresholds, filtering, effect-size estimation, etc.
“Between different experiments...” Do you mean between different studies or across experiments within the same study? I mean, in either case, counts of significant genes are not really a reliable proxy for “how much biology happened.” This is because of things like different library preps, different experimental designs, different thresholds, different dispersions, different model contrasts, etc. (Appropriate jargon here is “confounding differences.”)
That said, I worked on a study where the lead author (who did all the benchwork) deliberately “harmonized” the different experiments, meaning they were processed identically (or as close as possible under the conditions) from start to finish. In that case, we were able to make these (and other) comparisons with less caveats than usual, because the NGS assays performed at the same, libraries were made with the same kits, read alignment were done with the same parameters to the same reference, features were counted in the same way, counts matrices were filtered similarly, the same normalizations and design matrices were applied, etc.
What do you mean by "directly compare"?
Actual number of genes in the two groups don't have a special meaning. They will change based on limits you set.