Question: What is the best GWAS software suitable for extremely large dataset? (ex. Plink, Hail, BGENIE....)
1
gravatar for Kim6845
3 months ago by
Kim684510
Kim684510 wrote:

Hi, Biostars

I am working with large genotype dataset for GWAS, especially UK Biobank dataset. It contains 93,095,623 autosomal SNPs on 50 million individuals. (recorded on bgen v1.2 format file separately for each chromosome, 100GB per a chromosome) (UK biobank: https://www.nature.com/articles/s41586-018-0579-z )

Even if Neale group have made public GWAS result on the genotype dataset, I have to conduct GWAS afresh, because my research is on a new phenotype not included in the original GWAS study. (Neale group GWAS result: http://www.nealelab.is/uk-biobank )

Firstly, I would like to conduct basic quality controls on the dataset. (ex. missing rate, MAF, HWE) Afterwards I would conduct GWAS on it with only one phenotype.

I found some tools that would be appropriate for the procedure. I think that not only selecting tool to use for GWAS but also selecting tool for QC carefully is important, because conducting QC on large dataset requires parsing it several times, resulting in consuming a lot of time. (explanation on necessity of bgen format: https://www.well.ox.ac.uk/~gav/bgen_format/ )

Could you advise me on selecting proper softwares?

Thanks!


For quality control,

  1. qctool (https://www.well.ox.ac.uk/~gav/qctool_v2/ ) pros: qc procedure optimized for bgen format (maybe...)
  2. Plink 2.0 compatible with bgen, compared with plink1.9

For GWAS,

  1. Plink 2.0

  2. Hail scala based scalable GWAS tool, optimized for cluster computing on environments like Google Cloud, AWS etc)
    example code: https://github.com/Nealelab/UK_Biobank_GWAS

  3. BGENIE GWAS tool optimized for bgen format https://jmarchini.org/bgenie/


bgen plink qc hail gwas • 286 views
ADD COMMENTlink modified 3 months ago by chrchang5235.5k • written 3 months ago by Kim684510
0
gravatar for chrchang523
3 months ago by
chrchang5235.5k
United States
chrchang5235.5k wrote:

This depends primarily on two things.

  1. Where do you want to consider genotype posterior probabilities in your QC and analysis, vs. just dosages? The bgen format stores genotype probability triples of the form {P(genotype = AA), P(genotype = AB), P(genotype = BB)}, where A and B are the two alleles. However, most QC and analysis steps collapse this triple down to a single dosage value, equal to the expected count of one of the alleles (so P(genotype = AB) + 2 * P(genotype = BB) for allele B). For both this reason and the efficiency gains that result from only worrying about dosages, plink 2.0's "pgen" file format only supports dosages. Thus, if you are using plink 2.0 as part of your analysis pipeline, if you have any steps which actually care about the raw genotype posterior probabilities, they must happen before conversion-to-pgen.

(Note that, when dosages are sufficient, plink 2.0 is consistently 10-100+ times faster than the bgen-based tools.)

  1. How much do you want to customize the main analysis? Plink 2.0 and qctool/BGENIE support the most common QC operations and types of regression; it sounds like both are sufficient for what you want to do today. However, if you want to perform data exploration beyond "standard GWAS", Hail is the best platform I'm aware of for Biobank-sized datasets.
ADD COMMENTlink written 3 months ago by chrchang5235.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 768 users visited in the last hour