Question

parallel for 10000 whole exome data

0

Entering edit mode

16 months ago

alwayshope ▴ 40

Dear all,

May I know the recommend solutions for the massive big sample parallel process? Open MPI... ?

Thanks a million!

parallel exome • 1.1k views

ADD COMMENT • link 16 months ago by alwayshope ▴ 40

1

Entering edit mode

What are you analyzing? What is your hardware? What is your pipeline? There are hundreds of possible answers to this.

ADD REPLY • link 16 months ago by barslmn ★ 2.1k

0

Entering edit mode

Thanks a lot! Trying to analyze large WES data(10000 or more), using the super computer, very standard WES pipeline. Thank you very much!

ADD REPLY • link 16 months ago by alwayshope ▴ 40

1

Entering edit mode

for the massive big sample parallel process

what does it mean ?

ADD REPLY • link 16 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks! Trying to analyze large WES data(10000 or more).

ADD REPLY • link 16 months ago by alwayshope ▴ 40

0

Entering edit mode

I'm sorry but "analyze" doesn't mean much more than "process".

ADD REPLY • link 16 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Sure, thanks a lot!

ADD REPLY • link 16 months ago by alwayshope ▴ 40

score 2 · Answer 1 · 2022-12-19

2

Entering edit mode

16 months ago

ATpoint 82k

Use a dedicated workflow manager, such as Nextflow or Snakemake. They offer parallelization over jobs along the way plus integrate well with containerization solutions such as Docker and Singularity.There are existing workflows for WES data, such as nf-core sarek. May I ask whether you produced these samples, or whether you plan to download and reanalyze? 10k is going to require extensive storage and computing resources.

ADD COMMENT • link 16 months ago by ATpoint 82k

0

Entering edit mode

Thanks a lot for your guidance!

Trying to get the data from UK biobank (eg. https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=23153) and do some ML model training before using the inhouse data(not yet generated). Yeah, the CARM file alone for 10000 WES would cost near 20T storage and huge computing power. And also computing efficiency (cores usage strategy) is quite important.

ADD REPLY • link 16 months ago by alwayshope ▴ 40

2

Entering edit mode

I wonder whether obtaining VCF files might not do the trick a million times more efficiently (given they provide it, I guess they do?)? In any case, mentioned workflow managers will offer you full flexibility. You can define processes (like alignment, filtering, whatever) and give these resources as desired and optimal, and the workflow manager will then take care of the parallelization across jobs (along the DAG) maxing out the infrastructure resources you give it. The WF managers have caching options to resume the pipeline in case of failures along the way, all these are critical features for very large jobs.

ADD REPLY • link 16 months ago by ATpoint 82k

2

Entering edit mode

I'm almost 100% certain that preprocced results in some form (like a VCF) already exist for the UKBB.

Unless your project is specifically about improving the processing of raw sequence data to variant calls, I would seriously consider using these. You will save yourself months of HPC time.

I'd guess each sample would take if the order of multiple CPU days, so you are probably looking at tens of CPU years in total. Even with 500 cores, working at 100% efficiency, that's this a good part of a month to produce something that already exists.

ADD REPLY • link 16 months ago by i.sudbery 19k

2

Entering edit mode

VCF for the ~500k UKBB participants is available : https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23141 ;)

ADD REPLY • link 16 months ago by Nicolas Rosewick 11k