Question

Benefit of running pathfinding for all samples together or individually?

0

Entering edit mode

13 months ago

twrl8 • 0

Hello,

Thank you for your continuous help so far!

I was wondering if there is there is a benefit to running the pathfinding through the PHG for all samples I have together or if it does not matter much?

As an example, if I have 15 sets of WGS data (fastq) of 15 different lines and want to run the -ImputePipelinePlugin -imputeTarget path step (or even pathToVCF). I would like to use minimap2 to do the alignments against the pangenome (as it is implemented in the default pipline), so inputType would be fastq. Is there anything in the pathfinding step itself that speaks against doing this in e.g. three separate "jobs" with 5 alignments and paths to do each? I.e. have 3 different keyfiles listing the fastq file locations correspondingly?

Or does the imputation of one path help with that of the next one?

(Running PHG v1.2 if it matters)

Thank you again and all the best!

phg • 790 views

ADD COMMENT • link 13 months ago by twrl8 • 0

score 1 · Accepted Answer · 2023-03-09

1

Entering edit mode

13 months ago

zrm22 ▴ 40

Hello,

There should be nothing stopping you splitting it up into 3 jobs other than needing to manage the 3 jobs instead of just one. You can also run some samples through with a given Path method and then some more at a later date with the same method name if you need to.

Between separate runs of Imputation, nothing is done to the Emission or Transition Probabilities within the ReadMappings or the Graph so the imputation of one path is independent of any future paths. We have discussed making a pipeline where you could update the Transmission Probabilities using a set of Imputed Paths, but we do not have this implemented.

ADD COMMENT • link 13 months ago by zrm22 ▴ 40

0

Entering edit mode

Hello,

thank you again for your help!

This is what I wanted to know, if the paths are independent. Thank you for clarifying.

I did try it, but wanted to make sure it does not affect the results in the end. What I noticed is that I can not run it as multiple processes at once, since it seems like only one process at a time can access the database (which makes sense I suppose). Though even running it consequtively should help me.

(For information, I am working on a cluster with each node accessing the same data storage. So starting the process on multiple nodes would help me use the resources better.)

ADD REPLY • link 13 months ago by twrl8 • 0

0

Entering edit mode

If you are running with an SQLite DB I believe it does lock you into one connection at a time. The PathFinding portion of the PHG after you have the alignments is actually fairly fast and multithreaded so it should not take a ton of time even if using a single machine.

The alignment step of the imputation does take some time and we have been working on new strategies to be able to deploy this better on cluster systems which allows you to use multiple nodes, but this is still in the testing phase. We will likely add it to a future release of the PHG.

ADD REPLY • link 13 months ago by zrm22 ▴ 40

0

Entering edit mode

Thank you again for the answer!

Would multiple connections be possible using a "postgres" database?

ADD REPLY • link 13 months ago by twrl8 • 0