Question: Extracting Sample Sizes From The Nhgri Gwas Catalog
5.6 years ago by
European Union
coleman_jonathan410 wrote:

Hi all,

I'm trying to generate a plot comparing the sample sizes of published GWAS with the number of associations each found with p <10^-8.

I've been using the NHGRI Catalog to obtain the relevant studies... identifying the significant findings is straightforward, but the sample sizes are contained in prose lines, with little consistency in their structure. For example, some will be listed as #cases,#controls, while others will say up to #individuals, etc. This means there is no obvious string separator to use to extract just the numbers.

Does anyone know of either a) a database of sample sizes for GWAS which lists the sample sizes numerically rather than as prose; or b) a way I can extract the sample sizes from the catalog (without manually going through several thousand papers...)?


Hi! I'm very interested in this plot that you planned to draw. Did you get it? Would you mind to share it with me. I will definitely give you the credit. Thanks!

5.6 years ago by
Richard Smith400
Cambridge, UK
Richard Smith400 wrote:

The HuGE GWAS Navigator I think includes all data from the NHGRI GWAS Catalog and other sources as well. There is a column in the file for sample size including initial and replicate where applicable, this is populated for a lot of the entries. The counts and populations are still in prose but I think are a bit more consistent so should be easier to parse. The disease/trait names from HuGE are certainly more consistent.

