Question

GCTA GREML for WGS with extremely large number of samples and SNPs

0

Entering edit mode

5.2 years ago

moldach ▴ 130

I've been trying to use GCTA (a tool for Genome-wide Complex Trait Analysis) for a larger job than usual and keep getting a message saying "killed" from our cluster when trying to do the segment based LD score step on my data which makes me think it's a memory issue. We don't have a job scheduler on our cluster currently and I believe the job was getting killed because it was the biggest process we went over memory.

Now I'm running the following script on binary files (78G for the .bed and 1.1G for .bim and 184K for .fam) on another cluster with a SLURM scheduler. I've requested 250G ram across 48 threads and it's all ready been running for two days so I'm wondering if there is a way to deal with this more efficiently (maximum wall time I'm currently allowed is 5 - would need to request more)?

#!/bin/bash
#SBATCH --time=5-00:00:00
#SBATCH --mem=250G
#SBATCH --cpus-per-task=48

gcta64 --bfile ./alspac_moms --ld-score-region 200 --thread-num 48 --out alspac_moms

In the basic GREML tutorial it suggests you can split up the data by chromosome, like so:

gcta64 --bfile test --chr 1 --maf 0.01 --make-grm --out test_chr1 --thread-num 10
gcta64 --bfile test --chr 2 --maf 0.01 --make-grm --out test_chr2 --thread-num 10
...
gcta64 --bfile test --chr 22 --maf 0.01 --make-grm --out test_chr22 --thread-num 10

Is it possible to do something similar for GREML in WGS or imputed data? For example:

gcta64 --bfile test --chr 1 --ld-score-region 200 --out test_chr1
gcta64 --bfile test --chr 2 --ld-score-region 200 --out test_chr2
...
lds_seg = read.table("test_ch1.score.ld",header=T,colClasses=c("character",rep("numeric",8)))
lds_seg = read.table("test_ch2.score.ld",header=T,colClasses=c("character",rep("numeric",8)))
...

So that I would have stratified SNPs by segment-based LD score for each chromosome and then make GRMs for each of these groups:

chr1_snp_group1.txt
chr1_snp_group2.txt
chr1_snp_group3.txt
chr1_snp_group4.txt
...
chr22_snp_group1.txt
chr22_snp_group2.txt
chr22_snp_group3.txt

And then perform the REML analysis on those 88 GRMs? Just wondering if that's a valid approach or if there's some way to deal with out-of-memory issues with large GWAS/imputed data?

GCTA GREML genome • 3.2k views

ADD COMMENT • link updated 20 months ago by Diego • 0 • written 5.2 years ago by moldach ▴ 130

score 0 · Answer 1 · 2021-08-31

0

Entering edit mode

2.6 years ago

anbai • 0

Did you get any solution for this? I am encountering the same issue here.

ADD COMMENT • link 2.6 years ago by anbai • 0

0

Entering edit mode

Also interested in a potential solution. Please share if at all possible how you managed.

ADD REPLY • link 20 months ago by Diego • 0