Question

Practical Haplotype Graph Consensus and Imputation Specifications

0

Entering edit mode

16 months ago

tkosfeld • 0

Hello,

My group has constructed a successful PHG from a grouping of 26 NAM founders and 11 Teosinte pseudogenomes. However, we're grappling with a lower presence of B73 5.0 (the reference used with the PHG) in our imputed haplotype percentages than what we've been expecting.

Below are some imputation specific questions for the folks at the Buckler lab:

During our consensus haplotype step, none of the ~18,000 gamete_groups of 111046 shared haplotypes included the foundational reference. Is the reference used to construct the PHG not included in consensus haplotype extraction?
Having conducted imputation on a single lines and groups. I've notice the paths returned for a specific line differs on how many other lines are in the imputation keyfile. If submitted together, are the paths for lines not calculated independently?
For the data we are currently imputing, I notice we have the choice between the Viterbi algorithm and forward-backward algorithm, would you have a recommendation between the two?

Thank you for your time.

practical-haplotype-graph phg • 2.4k views

ADD COMMENT • link 14 months ago by tkosfeld • 0

score 2 · Answer 1 · 2022-12-19

Hi Tim,

For the third question, the default is Viterbi. lcj34 (Lynn) has been minimally involved with the imputation pipeline, so her answer was not exactly correct. On the other hand, the answer to question 1 was correct, so I have nothing to add there. Most of our testing and use of imputation has used a faster modification of the Viterbi algorithm. BackwardForward is somewhat slower as it requires two passes over the data. It also allows the user to only impute reference ranges with a minimum probability that the best haplotype was chosen. While the Viterbi algorithm picks the most likely haplotype in each reference range, in some cases the next best is almost as probable. In limited testing, the BackwardForward algorithm with appropriate parameter settings can provide somewhat more accurate imputation, it also imputes fewer sites. So, tradeoff. For question 2, I am puzzled. Each line in a key file is imputed independently, so the results for an individual should not be affected by which other lines are present.

score 1 · Answer 2 · 2022-12-19

1

Entering edit mode

16 months ago

lcj34 ▴ 420

Hi Tim - regarding your first question: The genomes that make up the consensus haplotypes are based on the methods you include when creating consensus. Note that when populating the initial database, the reference has its own method. Be sure to include this method in the list you use when creating consensus. You can check the methods from the methods table .

Regarding the Viterbi vs the BackwardForward algorithm. The default is to NOT use Viterbi as it is significantly slower than the BackwardForward option.

I'm not as familiar with this code as some of my other colleagues so I'll ask them to comment on your other questions.

ADD COMMENT • link 16 months ago by lcj34 ▴ 420

0

Entering edit mode

This and Peter's answers were incredibly helpful Lynn, thank you for your responses.

During our imputation, we're especially interested in imputing from individual and consensus haplotypes simultaneously. However, we encounter the error:

taxon: X represented more than once for reference range:

when specifying pangenomeHaplotypeMethod/pathingHaplotypeMethod as: method2ref:assembly_by_anchorwave:consensus.

Would there be a specific methodology to call from individual and consensus haplotypes, or would it be possible to specify imputation targets by something such as haplotype_id?

Thank you for your time.

ADD REPLY • link 15 months ago by tkosfeld • 0

0

Entering edit mode

Hi Tim -

The limitation is caused by the Graph created and used for imputation. Each node can only be represented once in the graph, and when you have both consensus and individual taxa, you end up with taxa (the nodes) represented multiple times.

It is possible to specify haplotype ids when creating the graph used for imputation, but if the ids include both consensus and individual haplotypes, you may hit the same problem. Is there a reason you want to impute simultaneously vs running the command separately for individuals vs consensus?

ADD REPLY • link 15 months ago by lcj34 ▴ 420

0

Entering edit mode

Hi Lynn -

Quite a bit of our quality control concerns itself with the percentages of haplotypes contributed to a path by each founder. We're hoping the simultaneous calling will give us the clearest percentages.

I did recall there was the ability to specify a list of hap_ids but I was having a bit of trouble finding it again in the documentation. I believe the command was something along the lines of HaplotypeIds=[]?

Thanks, Tim

ADD REPLY • link 15 months ago by tkosfeld • 0

0

Entering edit mode

TIm - When you say you want to know the "percentages of haplotypes contributed to a path by each founder" you're looking for how much each taxa is represented in your paths?

If you ran with only non-consensus haplotypes, the code would make a best guess when there were "ties" between the single-taxon haplotypes (based on previous haplotypes selected). Only one of those individual taxons that were equally likely would be chosen. But if you ran with consensus, you would get all of them and the issue is determining which taxa were represented by each consensus haplotype.

So I think you need to run with consensus only, and then translate the taxa represented at each haplotype. We might have code in rPHG that can handle this. I'll check with our rPHG person to see what he suggests.

ADD REPLY • link 15 months ago by lcj34 ▴ 420

0

Entering edit mode

Hi Lynn,

That is correct, we're quite interested in the ratios between our contributing taxa. I really appreciate you looking into it, but I believe we've figured things out on our end for calculating ratios.

However, consensus pathing has opened up some questions for us about how haplotypes are calculated. In our most recent consensus run, >60% of our imputation path was attributed to consensus groups containing all 37 of our founders.

While I attribute this mostly to parameters, it also led to me noticing a serious deviation between "asm_start_coordinates", "seq_len", and "asm_end_coordinates" in haplotypes at similar positions in different founders. I was under the impression that our reference ranges would at least determine the length of stored haplotypes, but that doesn't seem to be the case for us. Would you consider this normal behavior?

ADD REPLY • link 15 months ago by tkosfeld • 0

0

Entering edit mode

Hi TIm -

If I understand your question correctly, you're wondering why the asm_start_coordinate and asm_end_coordinate fields differ between different genomes? This is expected. Those fields in the db indicate where on the assembly the sequence was aligned to the reference for that haplotype. The haplotypes stored are not MSAs. They are pulled from anchorwave sequencing (or mummer4 if you are using old code). Due to insertions/deletions, we expect the haplotype sizes to be different.

the seq_len should match the (asm_end_coordinate - asm_start_coordinate) + 1 (when on the forward strand). Are you seeing issues with the seq_len presented differing from what is expected based on the start/end coordinates?

ADD REPLY • link 14 months ago by lcj34 ▴ 420

0

Entering edit mode

The accuracy of seq_len and the database as a whole has been excellent, we trust your math for sure!

I think our concern is little more abstract/biologically based. Our initial understanding was that reference ranges would strictly determine the division of genomes into haplotypes so we were a little concerned that we made a mistake when the coordinates and haplotype lengths seemed incongruent with our reference range parameters.

From your answer everything sounds like its working as intended.

ADD REPLY • link 14 months ago by tkosfeld • 0

0

Entering edit mode

Hi Lynn,

I just wanted to give a quick follow up to this chain. Imputation has proceeded much more smoothly with your help, and our main concern currently is assessing our imputed paths for accuracy by checking the percentage of haplotypes presented by each founder.

Currently, we're looking at an unusually high B73 percentage present (>90% counting consensus haplotypes) across our paths. We were wondering if during your work with B73 + NAM you collected and haplotype contribution statistics. We are especially interested in % of your paths that were exclusively B73, and % of paths that were a consensus haplotype containing B73 (and presumably some number of NAM founders).

Thanks for your time as always, Tim

ADD REPLY • link 14 months ago by tkosfeld • 0

0

Entering edit mode

If you're counting consensus and imputation matches both B73 and other genomes, but the ranking file has B73 as the highest, you will get B73 picked ahead of these other genomes. This would account for the high percentage of B73 you're seeing (and yes, we've seen that ourselves).

When we impute, we generally use the consensus - we don't have a means of identifying what matches B73 exclusively vs B73 from consensus. having said that, I'll ask Peter to weigh in as he is more familiar with this part of the code.

ADD REPLY • link 14 months ago by lcj34 ▴ 420

0

Entering edit mode

Thanks for the reminder of the ranking file Lynn, I'll definitely give it a look.

We've been running consensus imputation as well, and we distinguish between exclusively B73 (~7% of our paths on average) and consensus haplotypes that contain B73 (~80% of our paths on average). Would these percentages resemble your work with B73/NAM at all?

ADD REPLY • link 14 months ago by tkosfeld • 0

1

Entering edit mode

The difference between exclusively B73 and consensus that contains B73 may be the result of your mxDiv parameter. If it is set too low, you may be collapsing too much.

One thing you can try is running the AssemblyHapMetricsPlugin and the AssemblyConsensusMetricPlugin.

These plugins print a tab-delimited file with metrics on a per reference range basis. It will show you which taxa is included with each reference range. You can run this against your database from the command line.

To see the parameters required, run TASSEL standalone with just the plugin name:

tassel-5-standalone/run_pipeline.pl -AssemblyConsensusMetricPlugin

tassel-5-standalone/run_pipeline.pl -AssemblyHapMetricPlugin

ADD REPLY • link 14 months ago by lcj34 ▴ 420

1

Entering edit mode

Hi Tim,

The results you are seeing depend a lot on the parameters used to create the consensus haplotypes. It may be that your parameters group more taxa into each consensus than you would like. >60% of reference ranges with only one haplotype sounds too high to me. Our lab has limited experience using NAM consensus haplotypes as most of the analysis has been done with non-consensus haplotypes. We did spend some time a while ago looking at the number of haplotypes created using different parameters.

ADD REPLY • link 14 months ago by pjb39 ▴ 200

0

Entering edit mode

Understood. Thanks for the helpful replies Peter and Lynn. We've been testing mxDiv ranges quite heavily, and I think AssemblyConsensusMetricPlugin/AssemblyHapMetricPlugin will be very helpful for our current analysis.

-Tim

ADD REPLY • link 14 months ago by tkosfeld • 0