I don't really follow the logic there - evaluating the performance of a tool should be independent of understanding how the tool works or why it works. This is like saying that we cannot evaluate an car unless the manufacturer discloses all their trade secrets and shows us the blueprints.
(Moreover the overwhelming majority of people that use an open source tool do not understand what it does internally - yet they should still be able to evaluate it)
For me the veiled message is that we need more independent datasets and more benchmark studies, so that the highly overfitted black-box software tools can be identified. I think it is a common practice to compare results from open-source software once one gets a dataset processed on the side of a company offering sequencing services using their proprietary protocol. However, the results of those benchmarks are mostly left behind once a paper is published.
PS In my primary field of research the situation is almost identical to the one described in the paper, with almost no way to test commercial software using synthetic data or raw sequencing data except for splitting a sample into two parts and doing a parallel library prep & analysis.