Question

Clarification on the usage of pangenomeHaplotypeMethod/pathHaplotypeMethod

0

Entering edit mode

14 months ago

twrl8 • 0

Hello!

I am currently trying to impute paths through a built Practical Haplotype Graph, i.e. use the -ImputePipelinePlugin -imputeTarget command. The PHG version I use is 1.2. I populated the database using assemblies and the built-in anchorwave plugin. I have fastq files as input for imputation.

I have trouble setting the pangenomeHaplotypeMethod/pathHaplotypeMethod parameters correctly. The error I get says: "CreateGraphUtils: methodId: no method name assembly_by_anchorwave". I do not quite understand the documentation here and here. Are these parameters not user defined?

Or are they perhaps set in a previous step? If so, it might be of import that I skipped the "Create Consensus Haplotypes" step, because it was marked as optional and I specifically wanted as many versions of each haplotype as the pangenome could contain. Though I do not find the pangenomeHaplotypeMethod/pathHaplotypeMethod parameters in the documentation of the "Create Consensus Haplotypes" step. Can I find the correct method names in the liquibase database itself? If so how?

If needed, my first error message:

[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - createHaplotypeNodes: haplotype method: assembly_by_anchorwave range group method: null
[pool-1-thread-1] DEBUG net.maizegenetics.pangenome.api.CreateGraphUtils - CreateGraphUtils: methodId: no method name assembly_by_anchorwave
java.lang.IllegalArgumentException: CreateGraphUtils: methodId: no method name assembly_by_anchorwave
        at net.maizegenetics.pangenome.api.CreateGraphUtils.methodId(CreateGraphUtils.java:1242)
        at net.maizegenetics.pangenome.api.CreateGraphUtils.createHaplotypeNodes(CreateGraphUtils.java:408)
        at net.maizegenetics.pangenome.api.CreateGraphUtils.createHaplotypeNodes(CreateGraphUtils.java:1009)
        at net.maizegenetics.pangenome.api.HaplotypeGraphBuilderPlugin.processData(HaplotypeGraphBuilderPlugin.java:84)
        at net.maizegenetics.plugindef.AbstractPlugin.performFunction(AbstractPlugin.java:111)
        at net.maizegenetics.pangenome.pipeline.ImputePipelinePlugin.runImputationPipeline(ImputePipelinePlugin.kt:191)
        at net.maizegenetics.pangenome.pipeline.ImputePipelinePlugin.processData(ImputePipelinePlugin.kt:151)
        at net.maizegenetics.plugindef.AbstractPlugin.performFunction(AbstractPlugin.java:111)
        at net.maizegenetics.plugindef.AbstractPlugin.dataSetReturned(AbstractPlugin.java:2017)
        at net.maizegenetics.plugindef.ThreadedPluginListener.run(ThreadedPluginListener.java:29)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[pool-1-thread-1] DEBUG net.maizegenetics.pangenome.api.HaplotypeGraphBuilderPlugin - CreateGraphUtils: methodId: Problem getting id for method: assembly_by_anchorwave

And my config file for this step:

# Imputation Pipeline parameters for fastq or SAM files

# Required Parameters!!!!!!!
#--- Database ---
host=localHost
user=xxx
password=xxx
DB=/PHG/phg_run1.db
DBtype=sqlite


#--- Used by liquibase to check DB version ---
liquibaseOutdir=/PHG/outputDir/

#--- Used for writing a pangenome reference fasta(not needed when inputType=vcf) ---
pangenomeHaplotypeMethod=assembly_by_anchorwave
pangenomeDir=/PHG/outputDir/pangenome
indexKmerLength=21
indexWindowSize=11
indexNumberBases=90G

#--- Used for mapping reads
inputType=fastq
readMethod=20230213_run1
keyFile=/PHG/readMapping_key_file.txt
fastqDir=/PHG/inputDir/imputation/fastq/
samDir=/PHG/inputDir/imputation/sam/
lowMemMode=true
maxRefRangeErr=0.25
outputSecondaryStats=false
maxSecondary=20
fParameter=f15000,16000
minimapLocation=minimap2

#--- Used for path finding
pathHaplotypeMethod=assembly_by_anchorwave
pathMethod=20230213_run1
maxNodes=1000
maxReads=10000
minReads=1
minTaxa=1
minTransitionProb=0.0005
numThreads=4
probCorrect=0.99
removeEqual=false
splitNodes=true
splitProb=0.99
usebf=true
maxParents = 1000000
minCoverage = 1.0
#parentOutputFile = **OPTIONAL**

#   used by haploid path finding only
usebf=true
minP=0.8

#   used by diploid path finding only
maxHap=11
maxReadsKB=100
algorithmType=classic

#--- Used to output a vcf file for pathMethod
outVcfFile=/PHG/outputDir/align/20230213_run1_variants.vcf
#~~~ Optional Parameters ~~~
#pangenomeIndexName=**OPTIONAL**
#readMethodDescription=**OPTIONAL**
#pathMethodDescription=**OPTIONAL**
debugDir=/PHG/debugDir/
#bfInfoFile=**OPTIONAL**
localGVCFFolder=/PHG/outputDir/align/gvcfs  # added because demanded by error message

phg • 842 views

ADD COMMENT • link 14 months ago by twrl8 • 0

0

Entering edit mode

14 months ago

lcj34 ▴ 420

You are imputing against an existing database. What methods were used to load the haplotypes ? When the pangenome graph is created, it uses a method (or set of methods) that the user provides. These must be existing methods that were used to create the database haplotypes.

Your keyfile shows you want a pangenome haploytpe graph created from the haplotypes that are associated with method name "assembly_by_anchorwave". If this method doesn't exist in the db, you would see the error printed above.

The impute pipeline will perform read mappings and store the reads into the db against a method that is user defined. This is the read_mapping parameter that is provided to the ImputePipelinePlugin. That also needs to be defined in the config file (or explicitly sent to the ImputePipelinePlugin.

We have plans to update the documentation to make this clearer. I hope this explanation helps.

ADD COMMENT • link 14 months ago by lcj34 ▴ 420

0

Entering edit mode

Hi, thank you very much for your answer. That is along the lines of what I was thinking, however I can not determine which parameter in the haplotype loading step defines the method name. To load the Haplotypes I used the "AssemblyMAFFromAnchorWavePlugin". Is there a way to read out the method name from the .db file (there should only be one method in my case)?

This is the config file I used for the first two steps (creating the database and loading the Haplotypes). It was initially generated via the createDefaultDirectory plugin and I then added the Anchorwave parameters according to here. As there are some Default parameters that contain the term 'method', I also tried setting the parameter to "mummer4" instead of "assembly_by_anchorwave", but this did not work either. So is it possible that this method name remained unset without error during the Haplotype loading step? If so is there a way to salvage the anchorwave alignments, as the took quite a while to run on my machiene?

Many thanks again!

host=localHost
user=xxx
password=xxx
DB=/PHG/phg_run1.db
DBtype=sqlite

########################################
#Required Parameters:
########################################

AssemblyMAFFromAnchorWavePlugin.outputDir=/PHG/outputDir
AssemblyMAFFromAnchorWavePlugin.keyFile=/PHG/load_asm_genome_key_file.txt
AssemblyMAFFromAnchorWavePlugin.gffFile=/PHG/inputDir/reference/Ref_splitChroms.gff3
AssemblyMAFFromAnchorWavePlugin.refFasta=/PHG/inputDir/reference/Ref_pseudomolecules_assembly.splitChroms.chrUn.fa
AssemblyMAFFromAnchorWavePlugin.threadsPerRun=3
AssemblyMAFFromAnchorWavePlugin.numRuns=4
AssemblyMAFFromAnchorWavePlugin.refMaxAlignCov=1
AssemblyMAFFromAnchorWavePlugin.queryMaxAlignCov=1
HaplotypeGraphBuilderPlugin.configFile=**UNASSIGNED**
CreateIntervalBedFilesPlugin.dbConfigFile=**UNASSIGNED**
CreateIntervalBedFilesPlugin.refRangeMethods=**UNASSIGNED**
GetDBConnectionPlugin.create=true
GetDBConnectionPlugin.config=/PHG/config.txt
LoadAllIntervalsToPHGdbPlugin.ref=/PHG/inputDir/reference/Ref_pseudomolecules_assembly.splitChroms.chrUn.fa
LoadAllIntervalsToPHGdbPlugin.genomeData=/PHG/inputDir/reference/load_genome_data.txt
LoadAllIntervalsToPHGdbPlugin.outputDir=/PHG/outputDir
LoadAllIntervalsToPHGdbPlugin.refServerPath=/PHG/inputDir/reference/
LoadAllIntervalsToPHGdbPlugin.anchors=/PHG/validanchors.bed
LiquibaseUpdatePlugin.outputDir=/PHG/outputDir/
LoadHaplotypesFromGVCFPlugin.wgsKeyFile=**UNASSIGNED**
LoadHaplotypesFromGVCFPlugin.bedFile=**UNASSIGNED**
LoadHaplotypesFromGVCFPlugin.haplotypeMethodName=**UNASSIGNED**
LoadHaplotypesFromGVCFPlugin.gvcfDir=**UNASSIGNED**
LoadHaplotypesFromGVCFPlugin.referenceFasta=/PHG/inputDir/reference/Ref_pseudomolecules_assembly.splitChroms.chrUn.fa
FilterGVCFSingleFilePlugin.inputGVCFFile=**UNASSIGNED**
FilterGVCFSingleFilePlugin.outputGVCFFile=/PHG/outputDir
FilterGVCFSingleFilePlugin.configFile=**UNASSIGNED**
RunHapConsensusPipelinePlugin.collapseMethod=**UNASSIGNED**
RunHapConsensusPipelinePlugin.dbConfigFile=**UNASSIGNED**
AssemblyHaplotypesMultiThreadPlugin.outputDir=/PHG/outputDir
AssemblyHaplotypesMultiThreadPlugin.keyFile=**UNASSIGNED**
referenceFasta=/PHG/inputDir/reference/Ref_pseudomolecules_assembly.splitChroms.chrUn.fa

########################################
#Defaulted parameters:
########################################
HaplotypeGraphBuilderPlugin.includeSequences=true
HaplotypeGraphBuilderPlugin.includeVariantContexts=false
CreateIntervalBedFilesPlugin.windowSize=1000
CreateIntervalBedFilesPlugin.bedFile=intervals.bed
LoadAllIntervalsToPHGdbPlugin.isTestMethod=false
LoadHaplotypesFromGVCFPlugin.queueSize=30
LoadHaplotypesFromGVCFPlugin.isTestMethod=false
LoadHaplotypesFromGVCFPlugin.mergeRefBlocks=false
LoadHaplotypesFromGVCFPlugin.numThreads=3
LoadHaplotypesFromGVCFPlugin.maxNumHapsStaged=10000
RunHapConsensusPipelinePlugin.minTaxa=1
RunHapConsensusPipelinePlugin.distanceCalculation=Euclidean
RunHapConsensusPipelinePlugin.minFreq=0.5
RunHapConsensusPipelinePlugin.isTestMethod=false
RunHapConsensusPipelinePlugin.minCoverage=0.1
RunHapConsensusPipelinePlugin.mxDiv=0.01
RunHapConsensusPipelinePlugin.clusteringMode=upgma_assembly
RunHapConsensusPipelinePlugin.maxClusters=30
RunHapConsensusPipelinePlugin.minSites=30
RunHapConsensusPipelinePlugin.maxThreads=1000
RunHapConsensusPipelinePlugin.kmerSize=7
AssemblyHaplotypesMultiThreadPlugin.mummer4Path=/mummer/bin/
AssemblyHaplotypesMultiThreadPlugin.loadDB=true
AssemblyHaplotypesMultiThreadPlugin.minInversionLen=7500
AssemblyHaplotypesMultiThreadPlugin.assemblyMethod=mummer4
AssemblyHaplotypesMultiThreadPlugin.entryPoint=all
AssemblyHaplotypesMultiThreadPlugin.isTestMethod=false
AssemblyHaplotypesMultiThreadPlugin.numThreads=3
AssemblyHaplotypesMultiThreadPlugin.clusterSize=250
numThreads=20
Xmx=750G
picardPath=/picard.jar
gatkPath=/gatk/gatk
tasselLocation=/tassel-5-standalone/run_pipeline.pl
fastqFileDir=/tempFileDir/data/fastq/
tempFileDir=/tempFileDir/data/bam/temp/
dedupedBamDir=/tempFileDir/data/bam/DedupBAMs/
filteredBamDir=/tempFileDir/data/bam/filteredBAMs/
gvcfFileDir=/tempFileDir/data/gvcfs/
extendedWindowSize=1000
mapQ=48

#Sentieon Parameters.  Uncomment and set to use sentieon:
#sentieon_license=**UNASSIGNED**
#sentieonPath=/sentieon/bin/sentieon


########################################
#Optional Parameters With No Default Values:
########################################
HaplotypeGraphBuilderPlugin.taxa=null
HaplotypeGraphBuilderPlugin.methods=null
HaplotypeGraphBuilderPlugin.chromosomes=null
HaplotypeGraphBuilderPlugin.haplotypeIds=null
HaplotypeGraphBuilderPlugin.localGVCFFolder=null
CreateIntervalBedFilesPlugin.extendedBedFile=null
LoadHaplotypesFromGVCFPlugin.haplotypeMethodDescription=null
RunHapConsensusPipelinePlugin.referenceFasta=null
RunHapConsensusPipelinePlugin.rankingFile=null
RunHapConsensusPipelinePlugin.collapseMethodDetails=null
AssemblyHaplotypesMultiThreadPlugin.gvcfOutputDir=null


#FilterGVCF Parameters.  Adding any of these will add more filters.#exclusionString=**UNASSIGNED**
#DP_poisson_min=0.0
#DP_poisson_max=1.0
#DP_min=**UNASSIGNED**
#DP_max=**UNASSIGNED**
#GQ_min=**UNASSIGNED**
#GQ_max=**UNASSIGNED**
#QUAL_min=**UNASSIGNED**
#QUAL_max=**UNASSIGNED**
#filterHets=**UNASSIGNED**

Note: Some parameters remain as UNASSIGNED as they were not needed for the steps I ran (at least from what I could find).

ADD REPLY • link 14 months ago by twrl8 • 0

score 1 · Accepted Answer · 2023-02-14

1

Entering edit mode

14 months ago

lcj34 ▴ 420

The AssemblyMAFFromAnchorWavePlugin does not load the haplotypes to the db. It merely runs anchorwave to align the assemblies to the reference, and then creates the gvcf.gz/tbi files that are needed as input to LoadHaplotypesFromGVCFPlugin.

Did you run LoadHaplotypesFromGVCFPlugin? it is with this plugin that you set the haplotype method. You should defined the parameter file's UNASSIGNED variables that are related to LoadHaplotypeFromGVCFPlugin and run that plugin to create/load haplotypes from your anchorwave aligned genomes. This is where you would set the "assembly_by_anchorwave" method (LoadHaplotypesFromGVCFPlugin.haplotypeMethodName=assembly_by_anchorwave)

ADD COMMENT • link 14 months ago by lcj34 ▴ 420

0

Entering edit mode

Ah thank you very much!! That is on me, for some reason I read the documentation like it would also load the haplotypes into the db.

I have now done the loading of gvcfs into the db, but still have questions.
For one, is it possible to impute the paths separately for every set of WGS data I have or is there some benefit to submit all together apart from automation?

Then, when I try to build the pangenome fasta this time, I get the below error and am wondering if I have another mistake earlier and if you could help me spot it. I am not working on different servers, instead all files and the db are in folders mounted on singularity using the -B option. For that reason I only used the path without the server address whenever I wrote one. However I also can not find the point in the pipeline where I would have defined the reference genome gvcf path. I copied the gvcf and the index into the folder named in the error message, but before it was simply located in /PHG/inputDir/reference/ Can I not use this without the server address? Or have I done another mistake earlier in the pipeline? Thank you very much in advance!!

[pool-1-thread-1] DEBUG net.maizegenetics.pangenome.api.HaplotypeGraphBuilderPlugin - genome path variable must be a semi-colon separated string, with the first portion indicating the server address, e.g. server;/path/to/file. Error on genomePath: /PHG/inputDir/loadDB/gvcfs//Ref.gvcf.gz
java.lang.IllegalArgumentException: genome path variable must be a semi-colon separated string, with the first portion indicating the server address, e.g. server;/path/to/file. Error on genomePath: /PHG/inputDir/loadDB/gvcfs//Ref.gvcf.gz

ADD REPLY • link 14 months ago by twrl8 • 0