I would have expected 200-250 genes in each two grp analysis.
Why? What do you know that you aren't describing about your data? I would say that many experiments can give large numbers of DE genes regardless of technique if that is simply the shape of your data. Perhaps you're comparing things which are very different from each other. Remember that DESeq and edgeR make assumptions about the data, such as most genes are not differentially expressed. If you happen to compare two things that are vastly different, you may get many genes with low P-values. You might ask yourself, in your 10k gene set, what is the smallest ratio of expression between your conditions? If you get something with a log2 fold change of 0.2 but a low p-value, would you believe that? (i.e. a change of 1.15 fold). Can you seriously detect a 15% difference in gene expression between two conditions? In these cases you could simply combine criteria, i.e. require a 2 fold-change (or whatever you would believe), and a low p-value, and tune it to get a number of genes you can reasonably pursue. Unless you are doing something wrong in the analysis, your experiment may simply be giving you a large number of genes, and you can simply rank them by p-value to get a top set, regardless of what that value is. There's nothing magic about 0.05, or 0.01 or 0.0001. Assuming a sound experiment, and reasonably clean data, your result is your result (except your initial statement of an expected number hints there's something else going on).
On the other hand, if you have an experiment that violates some of the normal assumptions about applying the negative binomial to your data (as DESeq does), and you have reason to tweak the model, there are parameters you can play with (such as dispersion).
tldr; it's not a matter of increasing stringency of the evaluation, simply use a lower p-value if you want fewer results.