Practical Haplotype Graph Consensus and Imputation Specifications
2
0
Entering edit mode
8 weeks ago
tkosfeld • 0

Hello,

My group has constructed a successful PHG from a grouping of 26 NAM founders and 11 Teosinte pseudogenomes. However, we're grappling with a lower presence of B73 5.0 (the reference used with the PHG) in our imputed haplotype percentages than what we've been expecting.

Below are some imputation specific questions for the folks at the Buckler lab:

1. During our consensus haplotype step, none of the ~18,000 gamete_groups of 111046 shared haplotypes included the foundational reference. Is the reference used to construct the PHG not included in consensus haplotype extraction?
2. Having conducted imputation on a single lines and groups. I've notice the paths returned for a specific line differs on how many other lines are in the imputation keyfile. If submitted together, are the paths for lines not calculated independently?
3. For the data we are currently imputing, I notice we have the choice between the Viterbi algorithm and forward-backward algorithm, would you have a recommendation between the two?

practical-haplotype-graph phg • 861 views
1
Entering edit mode
7 weeks ago
lcj34 ▴ 120

Hi Tim - regarding your first question: The genomes that make up the consensus haplotypes are based on the methods you include when creating consensus. Note that when populating the initial database, the reference has its own method. Be sure to include this method in the list you use when creating consensus. You can check the methods from the methods table .

Regarding the Viterbi vs the BackwardForward algorithm. The default is to NOT use Viterbi as it is significantly slower than the BackwardForward option.

I'm not as familiar with this code as some of my other colleagues so I'll ask them to comment on your other questions.

0
Entering edit mode

During our imputation, we're especially interested in imputing from individual and consensus haplotypes simultaneously. However, we encounter the error:

taxon: X represented more than once for reference range:

when specifying pangenomeHaplotypeMethod/pathingHaplotypeMethod as: method2ref:assembly_by_anchorwave:consensus.

Would there be a specific methodology to call from individual and consensus haplotypes, or would it be possible to specify imputation targets by something such as haplotype_id?

0
Entering edit mode

Hi Tim -

The limitation is caused by the Graph created and used for imputation. Each node can only be represented once in the graph, and when you have both consensus and individual taxa, you end up with taxa (the nodes) represented multiple times.

It is possible to specify haplotype ids when creating the graph used for imputation, but if the ids include both consensus and individual haplotypes, you may hit the same problem. Is there a reason you want to impute simultaneously vs running the command separately for individuals vs consensus?

0
Entering edit mode

Hi Lynn -

Quite a bit of our quality control concerns itself with the percentages of haplotypes contributed to a path by each founder. We're hoping the simultaneous calling will give us the clearest percentages.

I did recall there was the ability to specify a list of hap_ids but I was having a bit of trouble finding it again in the documentation. I believe the command was something along the lines of HaplotypeIds=[]?

Thanks, Tim

0
Entering edit mode

TIm - When you say you want to know the "percentages of haplotypes contributed to a path by each founder" you're looking for how much each taxa is represented in your paths?

If you ran with only non-consensus haplotypes, the code would make a best guess when there were "ties" between the single-taxon haplotypes (based on previous haplotypes selected). Only one of those individual taxons that were equally likely would be chosen. But if you ran with consensus, you would get all of them and the issue is determining which taxa were represented by each consensus haplotype.

So I think you need to run with consensus only, and then translate the taxa represented at each haplotype. We might have code in rPHG that can handle this. I'll check with our rPHG person to see what he suggests.

0
Entering edit mode

Hi Lynn,

That is correct, we're quite interested in the ratios between our contributing taxa. I really appreciate you looking into it, but I believe we've figured things out on our end for calculating ratios.

However, consensus pathing has opened up some questions for us about how haplotypes are calculated. In our most recent consensus run, >60% of our imputation path was attributed to consensus groups containing all 37 of our founders.

While I attribute this mostly to parameters, it also led to me noticing a serious deviation between "asm_start_coordinates", "seq_len", and "asm_end_coordinates" in haplotypes at similar positions in different founders. I was under the impression that our reference ranges would at least determine the length of stored haplotypes, but that doesn't seem to be the case for us. Would you consider this normal behavior?

0
Entering edit mode

Hi TIm -

If I understand your question correctly, you're wondering why the asm_start_coordinate and asm_end_coordinate fields differ between different genomes? This is expected. Those fields in the db indicate where on the assembly the sequence was aligned to the reference for that haplotype. The haplotypes stored are not MSAs. They are pulled from anchorwave sequencing (or mummer4 if you are using old code). Due to insertions/deletions, we expect the haplotype sizes to be different.

the seq_len should match the (asm_end_coordinate - asm_start_coordinate) + 1 (when on the forward strand). Are you seeing issues with the seq_len presented differing from what is expected based on the start/end coordinates?

0
Entering edit mode

The accuracy of seq_len and the database as a whole has been excellent, we trust your math for sure!

I think our concern is little more abstract/biologically based. Our initial understanding was that reference ranges would strictly determine the division of genomes into haplotypes so we were a little concerned that we made a mistake when the coordinates and haplotype lengths seemed incongruent with our reference range parameters.

1
Entering edit mode
7 weeks ago
pjb39 ▴ 80

Hi Tim,

For the third question, the default is Viterbi. lcj34 (Lynn) has been minimally involved with the imputation pipeline, so her answer was not exactly correct. On the other hand, the answer to question 1 was correct, so I have nothing to add there. Most of our testing and use of imputation has used a faster modification of the Viterbi algorithm. BackwardForward is somewhat slower as it requires two passes over the data. It also allows the user to only impute reference ranges with a minimum probability that the best haplotype was chosen. While the Viterbi algorithm picks the most likely haplotype in each reference range, in some cases the next best is almost as probable. In limited testing, the BackwardForward algorithm with appropriate parameter settings can provide somewhat more accurate imputation, it also imputes fewer sites. So, tradeoff. For question 2, I am puzzled. Each line in a key file is imputed independently, so the results for an individual should not be affected by which other lines are present.

0
Entering edit mode

Peter - thanks for responding and clarifying for TIm. Tim - Peter is the guru of this code so believe everything he says!