host option

Question

No localGVCFFolder parameter in config file - problem passing parameter to pipeline

0

Entering edit mode

14 months ago

matt.shenton • 0

Hi there, a bit mystified with how to pass this parameter. It's there in my config file, and seems to be read, but then eventually i get the warning that localGVCFFolder doesn't have a parameter in the config file

WARN net.maizegenetics.pangenome.pipeline.MakeInitialPHGDBPipelinePlugin - No localGVCFFolder parameter in config file - will not copy created reference gvcfs to folder for consensus processing.

STEP0

sudo singularity build phg_20230209.simg docker://maizegenetics/phg

WORKING_DIR="/phg/rc_small_db"

singularity exec -B /home/mshenton/analysis/PHG/:/phg/ /home/mshenton/analysis/PHG/phg_20230209.simg /tassel-5-standalone/run_pipeline.pl -debug -Xmx1G -MakeDefaultDirectoryPlugin -workingDir ${WORKING_DIR} -endPlugin

STEP1

WORKING_DIR="/home/mshenton/analysis/PHG/rc_small_db"

SINGULARITY_CONFIG_FILE=/phg/DBconfig.txt

singularity exec -B $WORKING_DIR:/phg/ /home/mshenton/analysis/PHG/phg_20230209.simg /tassel-5-standalone/run_pipeline.pl \ -Xmx20G -debug -configParameters ${SINGULARITY_CONFIG_FILE} \ -MakeInitialPHGDBPipelinePlugin -endPlugin

DBconfig.txt:

host option

host=localHost user=sqlite password=sqlite DB=/phg/rc_small_db.db DBtype=sqlite outputDir=/phg/outputDir liquibaseOutdir=/phg/outputDir refServerPath=localhost:/ referenceFasta=/phg/inputDir/reference/IRGSP-1.0_genome_M_C_unanchored.fa genomeData=/phg/inputDir/reference/load_genome_data.txt anchors=/phg/inputDir/reference/valid1000RAP-DB_MSU_intervals.bed localGVCFFolder=/phg/GVCFFolder

Blockquote[main] INFO net.maizegenetics.plugindef.ParameterCache - load: loading parameter cache with: /phg/DBconfig.txt [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: anchors value: /phg/inputDir/reference/valid1000RAP-DB_MSU_intervals.bed [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: configFile value: /phg/DBconfig.txt [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: outputDir value: /phg/outputDir [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: referenceFasta value: /phg/inputDir/reference/IRGSP-1.0_genome_M_C_unanchored.fa [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: user value: sqlite [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: DB value: /phg/rc_small_db.db [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: DBtype value: sqlite [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: localGVCFFolder value: /phg/GVCFFolder [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: liquibaseOutdir value: /phg/outputDir [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: password value: sqlite [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: host value: localHost [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: refServerPath value: localhost:/ [main] INFO net.maizegenetics.plugindef.ParameterCache - ParameterCache: key: genomeData value: /phg/inputDir/reference/load_genome_data.txt

[.......]

Blockquote[pool-1-thread-1] INFO net.maizegenetics.pangenome.liquibase.LiquibaseUpdatePlugin - Please wait, begin Command:liquibase --driver=org.sqlite.JDBC --url=jdbc:sqlite:/phg/rc_small_db.db --username=sqlite --password=sqlite --changeLogFile=changelogs/db.changelog-master.xml --loglevel=FINE changeLogSync [pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Finished net.maizegenetics.pangenome.liquibase.LiquibaseUpdatePlugin: time: Feb 9, 2023 5:10:12 [pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.MakeInitialPHGDBPipelinePlugin - Done setting up Liquibase. [pool-1-thread-1] WARN net.maizegenetics.pangenome.pipeline.MakeInitialPHGDBPipelinePlugin - No localGVCFFolder parameter in config file - will not copy created reference gvcfs to folder for consensus processing. [pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.MakeInitialPHGDBPipelinePlugin - MakeInitialPHGDBPipelinePlugin complete! [pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Finished net.maizegenetics.pangenome.pipeline.MakeInitialPHGDBPipelinePlugin: time: Feb 9, 2023 5:10:12 [pool-1-thread-1] INFO net.maizegenetics.pipeline.TasselPipeline - net.maizegenetics.pangenome.pipeline.MakeInitialPHGDBPipelinePlugin: time: Feb 9, 2023 5:10:12: progress: 100%

localGVCFFolder config PHG MakeInitialPHGDBPipelinePlugin • 1.2k views

ADD COMMENT • link 14 months ago by matt.shenton • 0

0

Entering edit mode

Hi there, a bit mystified with how to pass this parameter. It's there in my config file, and seems to be read, but then eventually i get the warning that localGVCFFolder doesn't have a parameter in the config file

what is the context of that question ?! "pass this parameter" to what ?

ADD REPLY • link 14 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Dear Pierre,

Thanks for responding. According to the warning, I thought that the MakeInitialPHGDBPipelinePlugin required the "localGVCFFolder parameter" and I was wondering how to set that correctly. Sorry for my poor explanation.

I will make further checks and respond to the thread again next week.

Best regards

Matt

ADD REPLY • link 14 months ago by matt.shenton ▴ 40

score 1 · Answer 1 · 2023-02-09

1

Entering edit mode

14 months ago

lcj34 ▴ 420

HI Matt - Can you tell me which version of the PHG you are running? In the latest versions, that warning no longer exists. We decided against programmatically moving files. This was related to doing file transfers of stored GVCF files from a host system to a local system. File transfers were an issue for some servers.

Currently, when the reference intervals are processed, the LoadAllIntervalstToPHGdbPlugin will create a gvcf file from the reference haplotypes. The ref gvcf and indexed gvcf are stored in the same folder as the ref fasta. You should copy those gvcf files to your defined localGVCFFolder to be picked up for use when creating graphs that need to include variant data.

I"m sorry this isn't clear. I will check the documentation and update as needed.

Lynn

ADD COMMENT • link 14 months ago by lcj34 ▴ 420

0

Entering edit mode

Dear Lynn,

Many thanks for your reply.

I used

singularity build phg_20230208.simg docker://maizegenetics/phg

on the 8th Feb 2023

I was under the impression that this would get me the latest version. My apologies, I can't make further checks until next week. I will update the thread again then.

Best regards

Matt

ADD REPLY • link 14 months ago by matt.shenton ▴ 40

1

Entering edit mode

Hi Matt -

Thanks for the info. We recommend when pulling PHG from docker you request a specific tag. This way you will always know which version was run as "latest" is updated each time a new image is posted to the hub.

Having said that, when the next version is posted, it will have a PHG version included in the logs when PHG is run .

Lynn

ADD REPLY • link 14 months ago by lcj34 ▴ 420

0

Entering edit mode

Dear Lynn,

Thanks again for looking at this.

I should explain what I want to do. I am starting with a single rice reference genome, and adding haplotypes from gvcf files I generated by short read mapping.

So, I am going through the pipelines again, this time I specified sudo singularity build phg1.3.simg docker://maizegenetics/phg:1.3

I think 1.3 is the latest version?

I still get the warning "[pool-1-thread-1] WARN net.maizegenetics.pangenome.pipeline.MakeInitialPHGDBPipelinePlugin - No localGVCFFolder parameter in config file - will not copy created reference gvcfs to folder for consensus processing."

although I have created a folder called "GVCFFolder" and included it's path in the config file "localGVCFFolder=/phg/GVCFFolder"

However, I have copied the ref gvcf and indexed gvcf to this folder, as you mentioned, and proceeded to the adding haplotypes step using CreateHaplotypesFromGVCF.groovy

This script seems to proceed OK, and I appear to have some halotypes in my DB (sqlite database)

A couple of questions here:

1) I have the gvcf files at inputDir/loadDB/gvcf and in my "local gvcf folder". Can i set localGVCFFolder as inputDir/loadDB/gvcf? As it stands, I will eventually have the files in three separate locations, including the gvcfServerPath, or have I misunderstood?

2) In the "load_genome_data.txt" file there is a "Method" column. Does this method affect the create consensus steps later?

3) I am trying to make consensus haplotypes using CreateConsensi.sh, but clearly I haven't got something right. I have "0 taxa used to build distance matrix in createDistanceMatrix". I guess I have failed to specify something (a method?)

sudo singularity exec -B ${WORKDIR}:/phg/ /home/mshenton/analysis/PHGsingularity/phg1.3.simg /CreateConsensi.sh ${SINGULARITY_CONFIG_FILE} IRGSP-1.0_genome_M_C_unanchored.fa GATK_PIPELINE CONSENSUS001_20

my config:

host=localHost user=sqlite password=sqlite DB=/phg/rc_small_db.db DBtype=sqlite

numThreads=2 Xmx=24G liquibaseOutdir=/phg/outputDir referenceFasta=/phg/inputDir/reference/IRGSP-1.0_genome_M_C_unanchored.fa anchors=/phg/inputDir/reference/valid1000RAP-DB_MSU_intervals.bed haplotypeMethod=genic consensusMethod=CONSENSUS001_20

outputDir=/phg/outputDir/align/ gvcfOutputDir=/phg/outputDir/align/gvcfs/

refRangeMethods=genic,intergenic extendedWindowSize=1000

includeVariants=true minSite=3 minCoverage=0.1 maxThreads=2 minTaxa=1 mxDiv=0.001

localGVCFFolder=/phg/GVCFFolder rankingFile=/phg/rankingFile.txt

[ForkJoinPool-1-worker-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - before loading hash, size of all geneotypes in genotype table=21 [ForkJoinPool-1-worker-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - refRangeRefRangeIDMap is null, creating new one with size : 1000 [ForkJoinPool-1-worker-0] INFO net.maizegenetics.pangenome.hapcollapse.RunHapConsensusPipelinePlugin - Running Cluster Assemblies. [ForkJoinPool-1-worker-0] INFO net.maizegenetics.pangenome.hapcollapse.RunHapConsensusPipelinePlugin - Loading variants into RangeMap [ForkJoinPool-1-worker-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - loadAnchorHash: at end, size of refRangeRefRangeIDMap: 1000, number of rs.next processed: 1000 [ForkJoinPool-1-worker-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - before loading hash, size of all methods in method table=7 [ForkJoinPool-1-worker-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - before loading hash, size of all groups in taxa_groups table=0 [ForkJoinPool-1-worker-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - before loading hash, size of all groups in gamete_groups table=21 [ForkJoinPool-1-worker-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - before loading hash, size of all gametes in gametes table=21 [ForkJoinPool-1-worker-0] DEBUG net.maizegenetics.pangenome.hapcollapse.ConsensusProcessingUtils - 0 taxa used to build distance matrix in createDistanceMatrix (line 549) [ForkJoinPool-1-worker-0] DEBUG net.maizegenetics.pangenome.hapcollapse.RunHapConsensusPipelinePlugin - Finished builder distance matrix for ref range 1 [ForkJoinPool-1-worker-0] ERROR net.maizegenetics.pangenome.hapcollapse.RunHapConsensusPipelinePlugin - Error processing ReferenceRange:01:98559-99170 ErrorMessage:0 [ForkJoinPool-1-worker-0] DEBUG net.maizegenetics.pangenome.hapcollapse.RunHapConsensusPipelinePlugin - 0 java.lang.ArrayIndexOutOfBoundsException: 0 Blockquote

ADD REPLY • link 14 months ago by matt.shenton • 0

1

Entering edit mode

to answer your questions: (1) yes, you can set localGVCF to inputDir/loadDB/gvcf. Any place is fine as long as the software can see them. Frequently people come back to a db after it has sat for awhile and they only want to run consensus, or impute, or create a VCF from paths. In those cases. In those cases, the software needs to know where a local copy of the gvcfs live. The assumption is you have stored the gvcfs on a server some where that multiple people can access to bring to their local machines. THe software is merely asking "where can I find a copy of these files on your local machine". if they are still in inputDir/loadDB/gvcf, then that is fine to give as the localGVCF dir

(2) The Method column in the load_genome_data.txt: This column is used to associate the haplotypes for the listed genome with a method. It doesn't effect the Consensus haplotypes. When you run consensus, you specify a consensus method at that time . We often have long names for ours with indicate a name for the haplotypes that were used to create the consensus, and often the parameters used when running the consensus pipeline, e.g. CONSENSUS_84plusRef_mxDiv_10toNeg4_maxClusers30

(3) The message "0 taxa used to build distance matrix in createDistanceMatrix" implies there were no variants for the taxa it tried to load. This could be an issue with not finding your gvcf files. But it also could mean there were no nodes in your graph. It might be helpful to see the full log file. if it is too big to post, you can send it to me privately at lcj34@cornell.edu

ADD REPLY • link 14 months ago by lcj34 ▴ 420

0

Entering edit mode

"0 taxa used to build distance matrix in createDistanceMatrix" seems to have been caused because the chromosome names in the gvcf reference were different from those in the gvcf files.

The "chr" prefix was removed by -CreateValidIntervalsFilePlugin , but remained in my original gvcf files.

I now seem to have things working as far as creating consensus.

Many thanks for your help

Matt

ADD REPLY • link 14 months ago by matt.shenton • 0

0

Entering edit mode

PS Does the CreateSmallGenomesPlugin still work for the latest version of PHG?

ADD REPLY • link 14 months ago by matt.shenton • 0

1

Entering edit mode

I think CreateSmallGenomesPlugin still works, but try it and let me know if you have problems.

ADD REPLY • link 14 months ago by lcj34 ▴ 420