Question

Empty intervals file while loading BAMs into PHG

0

Entering edit mode

18 months ago

nicohlara • 0

I have successfully set up a PHG using version 1.2 of the Docker tool through step 1, and am now attempting to load in BAMs as part of step 2. When I run:

singularity exec -B ${WORKING_DIR}/phg $DOCKER /CreateHaplotypesFromBAM.groovy -config $CONFIG_FILE

I get:

ERROR net.maizegenetics.plugindef.AbstractPlugin - Error Loading in Bed file, file is empty.  Please double check: phg/inputDir/loadDB/bam/temp/intervals.bed

When I look at the intervals.bed file manually, it is indeed empty. However, running the -CreateValidIntervalsFilePlugin by itself results in a populated intervals.bed file with no errors or problems. I can't figure out how to use this intervals file for the CreateHaplotypesFromBAM.groovy command though, or make CreateHaplotypesFromBAM.groovy work.

singularity exec -B ${WORKING_DIR}/phg $DOCKER /tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters $CONFIG_FILE \
-CreateValidIntervalsFilePlugin -intervalsFile ${WORKING_DIR}/phg/anchors.bed \
-referenceFasta ${WORKING_DIR}/phg/inputDir/reference/iwgsc_refseqv2.1_assembly_chr_split.fa \
-mergeOverlaps true \
-generatedFile "$INTERVAL.bed" -endPlugin

I have also tried running CreateHaplotypesFromBAM.groovy with all the flags from my -CreateValidIntervalsFilePlugin and get the same result: an unpopulated intervals.bed file.

Below is a more complete error report and my config file:

[pool-1-thread-1] DEBUG net.maizegenetics.pangenome.db_loading.CreateIntervalBedFilesPlugin - Getting the db connection from the file
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - first connection: dbName from config file = phg/srww_phg_v2dot1.db host: localHost user: sqlite type: 
sqlite
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Database URL: jdbc:sqlite:phg/srww_phg_v2dot1.db
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Connected to database:  

[pool-1-thread-1] DEBUG net.maizegenetics.pangenome.db_loading.CreateIntervalBedFilesPlugin - Pulling the reference ranges from the graph stored in the database
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRanges: query statement: select reference_ranges.ref_range_id, chrom, range_start, range_end, meth
ods.name from reference_ranges  INNER JOIN ref_range_ref_range_method on ref_range_ref_range_method.ref_range_id=reference_ranges.ref_range_id  INNER JOIN methods on ref_range_ref_r
ange_method.method_id = methods.method_id  AND methods.method_type = 7 ORDER BY reference_ranges.ref_range_id
methods size: 1
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRanges: number of reference ranges: 27978
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRanges: time: 0.133820701 secs.
[pool-1-thread-1] DEBUG net.maizegenetics.pangenome.db_loading.CreateIntervalBedFilesPlugin - Writing out the BED files using the reference ranges pulled from the graph.

Config:

###config file. 
### Anything marked with UNASSIGNED needs to be set for at least one of the steps
### If it is marked as OPTIONAL, it will only need to be set if you want to run specific steps. 
host=localHost
user=sqlite
password=sqlite
DB=phg/srww_phg_v2dot1.db
DBtype=sqlite
outputDir=phg/outputDir

##Step 1B
# Load genome intervals parameters
referenceFasta=/90daydata/genolabswheatphg/SRWW_PHG_3/phg/inputDir/reference/iwgsc_refseqv2.1_assembly_chr_split.fa
anchors=/90daydata/genolabswheatphg/SRWW_PHG_3/phg/anchors.bed
genomeData=phg/inputDir/reference/load_genome_data.txt
localGVCFFolder=phg/outputDir/GVCF_local
###Not included in example config
refServerPath=Atlas-dtn.hpc.msstate.edu;/project/genolabswheatphg/srww_phg/ref

liquibaseOutdir=phg/outputDir

#System parameters.  Xmx is the java heap size and numThreads will be used to set threads available for multithreading components.
Xmx=100G
numThreads=20

##Keyfile location.
keyFile=phg/loadHapsGVCF_keyfile_BAMS.txt
#keyFile=phg/loadHaps_fasta_keyfile.txt


asmMethodName=mummer4
wgsMethodName=GATK_PIPELINE

consensusMethodName=CONSENSUS
inputConsensusMethods=GATK_PIPELINE

fastqFileDir=phg/inputDir/loadDB/fastq/
dedupedBamDir=phg/inputDir/loadDB/bam/dedup/
#dedupedBamDir=phg/inputDir/BAMs/

#gvcfFileDir=phg/inputDir/loadDB/gvcf/
gvcfDir=phg/inputDir/loadDB/gvcf/
#localGVCFFolder=phg/outputDir/GVCF_local
filteredBamDir=phg/inputDir/BAMs_filtered/
wgsKeyFile=phg/loadHapsGVCF_keyfile_BAMS.txt
mapQ=48
refRangeMethods=FocusRegion,FocusComplement
extendedWindowSize=1000
haplotypeMethodName=TEST_PARENT_LOAD
gvcfFileDir =phg/inputDir/loadDB/gvcf/
tempFileDir =phg/inputDir/loadDB/bam/temp/
filteredOutputBAMDir=phg/inputDir/loadDB/bam/mapqFiltered/
dedupedBAMDir=phg/inputDir/loadDB/bam/dedup/
intervalsFile=phg/anchors.bed
generatedFile=phg/inputDir/loadDB/bam/temp/intervals.bed

###Assembly from alignment using anchorwave settings
#AssemblyMAFFromAnchorWavePlugin.outputDir=phg/outputDir
#AssemblyMAFFromAnchorWavePlugin.keyFile=phg/anchorwave_keyfile.txt
#AssemblyMAFFromAnchorWavePlugin.gffFile=phg/anchors.gff3
#AssemblyMAFFromAnchorWavePlugin.refFasta=phg/inputDir/reference/iwgsc_refseqv2.1_assembly_chr_split.fa
#AssemblyMAFFromAnchorWavePlugin.threadsPerRun=4
#AssemblyMAFFromAnchorWavePlugin.numRuns=2


# WGS Haplotype Filtering criteria.  These are the defaults.
GQ_min=50
QUAL_min=200
DP_poisson_min=.01
DP_poisson_max=.99
filterHets=true

##Consensus Plugin Parameters
minFreq=0.5
maxClusters=30
minSite=30
minCoverage=0.1
maxThreads=10
minTaxa=1
mxDiv=0.01

#This sets the type of clustering mode.
#Valid params are: upgma, upgma_assembly, and kmer_assembly
#The two assembly parameters are designed for assembly haplotypes and will choose a representative haplotype as the consensus instead of attempting to merge calls like with upgma.
clusteringMode=kmer_assembly

#If you want to use an assembly clusteringMode, you must have a ranking file.
#The ranking file must be a tab separated list of taxon\trankingScore where higher numbers are a better rank.  This file is used to chose the representative haplotype
rankingFile=phg/ranking_file.txt

##Optional if you want to use kmer_assembly as the clusteringMode. Otherwise is ignored 
kmerSize=7
distanceCalculation=Euclidean

##Graph building parameters
includeVariants=true

PHG • 521 views

ADD COMMENT • link updated 18 months ago by lcj34 ▴ 420 • written 18 months ago by nicohlara • 0

0

Entering edit mode

You should not need to run CreateValidIntervalsFilePlugin. Instead you should use the same bed file you used during Step one to populate the DB. Looking at your log my best guess is that its the same bed as here:

/90daydata/genolabswheatphg/SRWW_PHG_3/phg/anchors.bed

Does this file have data in it? Also verify that there were no errors in the log of the Step 1 run.

ADD REPLY • link 18 months ago by zrm22 ▴ 40

score 0 · Answer 1 · 2022-11-11

Nico - Do you have the output from when you ran the initial plugin? The groovy script is creating the intervals.bed file by pulling ranges from the db. It doesn't look like anything is stored to the db and that is why the file is empty.

The CreateValidIntervalsFilePlugin takes an existing intervals file, and creates a file in the format required by the PHG LoadAllIntervalsToPHBdbPlugin script. But this file isn't stored anywhere - it is merely available for use when loading the initial db.

I'm concerned the initial run didn't finish correctly. If you can post (or send to us directly) the output from your initial steps, that might help. Also please include your config file. It would be good to check the variables in the config file with what shows up in the output file from when you ran the PHG steps.