Hello,
My group has constructed a successful PHG from a grouping of 26 NAM founders and 11 Teosinte pseudogenomes. However, we're grappling with a lower presence of B73 5.0 (the reference used with the PHG) in our imputed haplotype percentages than what we've been expecting.
Below are some imputation specific questions for the folks at the Buckler lab:
- During our consensus haplotype step, none of the ~18,000 gamete_groups of 111046 shared haplotypes included the foundational reference. Is the reference used to construct the PHG not included in consensus haplotype extraction?
- Having conducted imputation on a single lines and groups. I've notice the paths returned for a specific line differs on how many other lines are in the imputation keyfile. If submitted together, are the paths for lines not calculated independently?
- For the data we are currently imputing, I notice we have the choice between the Viterbi algorithm and forward-backward algorithm, would you have a recommendation between the two?
Thank you for your time.
This and Peter's answers were incredibly helpful Lynn, thank you for your responses.
During our imputation, we're especially interested in imputing from individual and consensus haplotypes simultaneously. However, we encounter the error:
taxon: X represented more than once for reference range:
when specifying pangenomeHaplotypeMethod/pathingHaplotypeMethod as: method2ref:assembly_by_anchorwave:consensus.
Would there be a specific methodology to call from individual and consensus haplotypes, or would it be possible to specify imputation targets by something such as haplotype_id?
Thank you for your time.
Hi Tim -
The limitation is caused by the Graph created and used for imputation. Each node can only be represented once in the graph, and when you have both consensus and individual taxa, you end up with taxa (the nodes) represented multiple times.
It is possible to specify haplotype ids when creating the graph used for imputation, but if the ids include both consensus and individual haplotypes, you may hit the same problem. Is there a reason you want to impute simultaneously vs running the command separately for individuals vs consensus?
Hi Lynn -
Quite a bit of our quality control concerns itself with the percentages of haplotypes contributed to a path by each founder. We're hoping the simultaneous calling will give us the clearest percentages.
I did recall there was the ability to specify a list of hap_ids but I was having a bit of trouble finding it again in the documentation. I believe the command was something along the lines of
HaplotypeIds=[]?
Thanks, Tim
TIm - When you say you want to know the "percentages of haplotypes contributed to a path by each founder" you're looking for how much each taxa is represented in your paths?
If you ran with only non-consensus haplotypes, the code would make a best guess when there were "ties" between the single-taxon haplotypes (based on previous haplotypes selected). Only one of those individual taxons that were equally likely would be chosen. But if you ran with consensus, you would get all of them and the issue is determining which taxa were represented by each consensus haplotype.
So I think you need to run with consensus only, and then translate the taxa represented at each haplotype. We might have code in rPHG that can handle this. I'll check with our rPHG person to see what he suggests.
Hi Lynn,
That is correct, we're quite interested in the ratios between our contributing taxa. I really appreciate you looking into it, but I believe we've figured things out on our end for calculating ratios.
However, consensus pathing has opened up some questions for us about how haplotypes are calculated. In our most recent consensus run, >60% of our imputation path was attributed to consensus groups containing all 37 of our founders.
While I attribute this mostly to parameters, it also led to me noticing a serious deviation between "asm_start_coordinates", "seq_len", and "asm_end_coordinates" in haplotypes at similar positions in different founders. I was under the impression that our reference ranges would at least determine the length of stored haplotypes, but that doesn't seem to be the case for us. Would you consider this normal behavior?
Hi TIm -
If I understand your question correctly, you're wondering why the asm_start_coordinate and asm_end_coordinate fields differ between different genomes? This is expected. Those fields in the db indicate where on the assembly the sequence was aligned to the reference for that haplotype. The haplotypes stored are not MSAs. They are pulled from anchorwave sequencing (or mummer4 if you are using old code). Due to insertions/deletions, we expect the haplotype sizes to be different.
the seq_len should match the (asm_end_coordinate - asm_start_coordinate) + 1 (when on the forward strand). Are you seeing issues with the seq_len presented differing from what is expected based on the start/end coordinates?
The accuracy of seq_len and the database as a whole has been excellent, we trust your math for sure!
I think our concern is little more abstract/biologically based. Our initial understanding was that reference ranges would strictly determine the division of genomes into haplotypes so we were a little concerned that we made a mistake when the coordinates and haplotype lengths seemed incongruent with our reference range parameters.
From your answer everything sounds like its working as intended.