Using QCTOOL v2 to process UK Biobank .bgen files - why so slow?
1
0
Entering edit mode
11 months ago

I’m currently using QCTOOL v2 to process imputed .bgen files from UK Biobank, however they seem to be processing very slowly. Is this normal?

My command is pretty basic; I’m filtering out a list of SNPs and samples:

/path_to/qctool \
-g /path_to/ukbXXXX_c22_b0_v3.bgen \
-s /path_to/ukbXXXX_c22_b0_v3.sample \
-og /output_path/ukb_c22_filt.bgen \
-os /output_path/ukb_c22_filt.sample \
-excl-rsids /path_to/snps_rem_c22.txt \
-excl-samples /path_to/samples_to_rem.txt


It is currently processing SNPs at a rate of 1.2/s (e.g. 71169/?,57205.8s,1.2/s). The computational facility I'm using should not be a speed limiter - have I made any mistakes in my qctool query? Or any other ideas on why it's running so slowly? Alternative suggestions on how to process these files would be appreciated (e.g. I would prefer to use QCTOOL but realising I may have to use PLINK).

ukbiobank plink qctool • 853 views
0
Entering edit mode

Hi! i am install qctool now, but i failed at compilation, did you meet the same issue? if so, how did you handle it? let me know, thanks!

1
Entering edit mode

Compiled binaries seem to be available in this directory: https://www.well.ox.ac.uk/~gav/resources/

0
Entering edit mode

Thank you so much!! I have visited the directory and tried to find out the one suit my system, but failed. however, i got version 2.0.7 last night, and finished compilation. Happy!

2
Entering edit mode
9 months ago

I contacted the QCTOOL team directly and received this response:

The design of BGEN means that when you subset samples the data has to be recompressed - this is essentially what makes this slow. (By contrast you can subset SNPs very quickly without recompression using bgenix https://code.enkre.net/bgen.) It is therefore definitely worth considering not subsetting samples but using a sample inclusion/exclusion list instead, if your analysis software supports that.

If you have to subset and want to use QCTOOL - some things to try are: I typically use the options -bgen-compression zstd -bgen-bits 8 now, c.f. https://doi.org/10.1101/308296 this is faster but first check your downstream software supports zstd compression. Use a map/reduce type pipeline (i.e. chunk data for re-encoding) - this can be implemented using bgenix and cat-bgen.

Have you tried first stripping the SNPs, then stripping the samples?

Ever since working with the imputed data from the UKB I was never successful in chopping pieces of bgen files using qctool in a timely manner. If my memory serves me well, qctool manages to strip SNPs fairly quickly, but is super slow to remove samples. Try PLINK 1.9 or 2.0.

As for “The computational facility I'm using should not limit the speed of an operation.”, depending on how disk data moves in/out of slave nodes and how busy the cluster is, I did saw processes to become I/O starved on large clusters.

Also, consider that most tools that perform association testing can utilize SNPs and samples lists. BOLT, SAIGE, regenie, SNPtest… so there might not be a need to pre-filter. Put “NA” as phenotypes you’d like to be “removed” from the testing.

In the end I did not pre-filter SNPs or samples - I set samples to NA within the phenotype file, and used a SNP inclusion list in the second stage of SAIGE with the flag idstoIncludeFile.