I'm trying to generate a plot comparing the sample sizes of published GWAS with the number of associations each found with p <10^-8.
I've been using the NHGRI Catalog to obtain the relevant studies... identifying the significant findings is straightforward, but the sample sizes are contained in prose lines, with little consistency in their structure. For example, some will be listed as #cases,#controls, while others will say up to #individuals, etc. This means there is no obvious string separator to use to extract just the numbers.
Does anyone know of either a) a database of sample sizes for GWAS which lists the sample sizes numerically rather than as prose; or b) a way I can extract the sample sizes from the catalog (without manually going through several thousand papers...)?