Question: GCTA GREML for WGS with extremely large number of samples and SNPs
0
gravatar for moldach
6 months ago by
moldach130
McGill, Douglas Mental Health University Institute
moldach130 wrote:

I've been trying to use GCTA (a tool for Genome-wide Complex Trait Analysis) for a larger job than usual and keep getting a message saying "killed" from our cluster when trying to do the segment based LD score step on my data which makes me think it's a memory issue. We don't have a job scheduler on our cluster currently and I believe the job was getting killed because it was the biggest process we went over memory.

Now I'm running the following script on binary files (78G for the .bed and 1.1G for .bim and 184K for .fam) on another cluster with a SLURM scheduler. I've requested 250G ram across 48 threads and it's all ready been running for two days so I'm wondering if there is a way to deal with this more efficiently (maximum wall time I'm currently allowed is 5 - would need to request more)?

#!/bin/bash
#SBATCH --time=5-00:00:00
#SBATCH --mem=250G
#SBATCH --cpus-per-task=48

gcta64 --bfile ./alspac_moms --ld-score-region 200 --thread-num 48 --out alspac_moms

In the basic GREML tutorial it suggests you can split up the data by chromosome, like so:

gcta64 --bfile test --chr 1 --maf 0.01 --make-grm --out test_chr1 --thread-num 10
gcta64 --bfile test --chr 2 --maf 0.01 --make-grm --out test_chr2 --thread-num 10
...
gcta64 --bfile test --chr 22 --maf 0.01 --make-grm --out test_chr22 --thread-num 10

Is it possible to do something similar for GREML in WGS or imputed data? For example:

gcta64 --bfile test --chr 1 --ld-score-region 200 --out test_chr1
gcta64 --bfile test --chr 2 --ld-score-region 200 --out test_chr2
...
lds_seg = read.table("test_ch1.score.ld",header=T,colClasses=c("character",rep("numeric",8)))
lds_seg = read.table("test_ch2.score.ld",header=T,colClasses=c("character",rep("numeric",8)))
...

So that I would have stratified SNPs by segment-based LD score for each chromosome and then make GRMs for each of these groups:

chr1_snp_group1.txt
chr1_snp_group2.txt
chr1_snp_group3.txt
chr1_snp_group4.txt
...
chr22_snp_group1.txt
chr22_snp_group2.txt
chr22_snp_group3.txt

And then perform the REML analysis on those 88 GRMs? Just wondering if that's a valid approach or if there's some way to deal with out-of-memory issues with large GWAS/imputed data?

gcta greml genome • 267 views
ADD COMMENTlink written 6 months ago by moldach130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1796 users visited in the last hour