first those search queries look good - you might try similar things in the search bar for Biostars, though - questions like this have been asked many times. here are som manuscripts and prior posts i recommend looking through:
others to include are "GWAS protocols" "GWAS quality control metrics" "GWAS QC metrics" and the like. Im going to throw a couple thoughts your way. I dont know of one single tool that will benchmark a whole work flow - but i haven't looked recently. what I can do is tell you things that are definitely done that should give you several ideas.
heuristic 1:
1) Unit Testing - benchmarking step by step. generally, software developers test the robustness of multistep processes in several ways - one of these is unit testing. here, instead of benchmarking the entire pipeline, you benchmark each individual step separately as you go. in terms of just raw development they might be doing this with respect to speed or reliability, but you can also do it in terms of results issued.
take imputation for instance. most imputation algorithms will generate accuracy scores by blinding themselves to some known SNPs and then imputing them then seeing how accurate they were. if you were to go back and look at that, you could, for instance, benchmark several imputation algorithms against one another. here, you might use the number of snvs imputed and the accuracy of each algorithm (as indicated from the masking procedure) to pick an imputation algorithm with best performance.
heuristic 2:
- Recovery of known "true positive results": let's take seropositive rheumatoid arthritis as a disease phenotype. A lot of GWA of RA have been done; indeed these have been organized into meta-analyses and redone with vast samples sizes (GWAMA). Nearly all of these studies nominate the HLA region (specifically HLA-DRB). Of those that don't, most or all of them do not separate seronegative and seropositive RA patients as much or as well.
So, if a researcher is doing yet another GWAS of seropositive rheumatoid arthritis, he or she should expecting to see a strong association for HLA-DRB1. If they fail to find that association, the most likely explanation is that they made an error somewhere in their pipeline. Likewise for the other of the top 10 strongest loci. If they fail to identify any, for instance, then something is definitely wrong...
heuristic 3:
3. other QC metrics: the other thing to do I think is take a look at well known QC metrics. like lambda GC for instance to look for genomic inflation, or like HWE to look for bad SNVs, or at LD support to see if a variant's association tracks with others strongly linked to it in the same area. its definitely also standard practice to do sample QC and SNP QC separately as well.
conclusions: one could create an end to end benchmark conceivably. but, it would probably be by linking together lots of heuristics / qc metrics in the first place. so, unfortunately, i dont know a better way forward than to take a good, rigorous look at each step separately.