Question

Expected number of haplotype IDs per path in PHG

0

Entering edit mode

12 months ago

twrl8 • 0

Hello!

I have recently managed to run the Pathfinding step of the Practical Haplotype Graph pipeline (see here; PHG v1.4).

With this I now wanted to get the calculated best path, i.e. the list of haplotype IDs that are part of that path, so I have written a script that does so (by co-opting the decodePathsForMultipleLists() function of the DBLoadingUtils class). I have tested this script by getting the path for my reference genome and gotten the same number of IDs as there are reference ranges, as expected.

However if I get the IDs for some of my imputed paths I get an incredibly low number of IDs. For example one sample has 66 IDs in it's path, despite there being 422,593 reference ranges. I have also not encountered any errors in the log written out for the run. Is this an issue with the chosen parameters, an error, or are all other ranges simply matching the reference genome?

I would greatly apprechiate any pointers on this.

The parameters I used for the run were:

## DB params....
## other params:
#--- Used for writing a pangenome reference fasta(not needed when inputType=vcf) ---
pangenomeHaplotypeMethod=assembly_by_anchorwave
pangenomeDir=/PHG/outputDir/pangenome
indexKmerLength=21
indexWindowSize=11
indexNumberBases=90G

#--- Used for mapping reads
inputType=fastq
readMethod=lineA_run1
keyFile=/PHG/fullRunConfigs/lineA_run1_readMapping_key_file.txt
fastqDir=/PHG/inputDir/imputation/fastq/
samDir=/PHG/inputDir/imputation/sam/
lowMemMode=true
maxRefRangeErr=0.25
outputSecondaryStats=false
maxSecondary=20
fParameter=f15000,16000
minimapLocation=minimap2

#--- Used for path finding
pathHaplotypeMethod=assembly_by_anchorwave
pathMethod=lineA_run1
maxNodes=1000
maxReads=10000
minReads=1
minTaxa=1
minTransitionProb=0.0005
numThreads=80
probCorrect=0.99
removeEqual=false
splitNodes=true
splitProb=0.99
usebf=true
minP=0.8


#--- Used to output a vcf file for pathMethod
#~~~ Optional Parameters ~~~
outVcfFile=/PHG/outputDir/align/lineA_run1_variants.vcf
localGVCFFolder=/PHG/inputDir/loadDB/gvcf
debugDir=/PHG/debugDir/lineA_run1/

PHG • 1.0k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 12 months ago by twrl8 • 0

0

Entering edit mode

Without knowing your species or what data has been loaded, it's hard to say, but it could be your parameters. You could try setting your probCorrect to 0.95 and your minReads to 0, run path finding again with a different method name, and see if you get better results for the samples that had few hits.

ADD REPLY • link 12 months ago by lcj34 ▴ 420

0

Entering edit mode

Hi, thank you for the advice.

I am working in barley, so a genome of ~5Gbp length. For the graph I loaded the reference genome and 19 assemblies which were picked for variety. My reference ranges are 10k bp each and cover the whole genome bit by bit, since I want equally good coverage/prediction quality for the path everywhere. I have also not merged using the consenus step, because I want as much possible sequence variety now and only choose later regarding things like minor allele frequency etc.

For my sample input data I have WGS reads with about 30x coverage.

ADD REPLY • link 12 months ago by twrl8 • 0

0

Entering edit mode

Thanks for the information . Let me know if running with the updated parameters gives you better results.

ADD REPLY • link 12 months ago by lcj34 ▴ 420

score 1 · Accepted Answer · 2023-05-30

FYI: After testing with various parameters we determined the problem was with the backward-forward algorithm of the BestHaplotypePathPlugin. The solution is to set the "usebf" parameter to "false", which allows the Viterbi algorithm to be run for the HMM. This resulted in a more representative number of haplotypes. Buckler Lab recommends users keep "usebf" set to false when running this plugin.