Hello, I've found strange behavior surrounding the vg's gbwt index and wanted to report it and potentially figure out why it occurs.
When I create a genome graph using minigraph-cactus, then use vg construct to create a vg graph and gbwt index, the sample and haplotype data is missing.
Graph creation:
cactus-minigraph ./js graph.txt graph-base.sv.gfa --reference ref1 ref2 ref3
cactus-graphmap ./js graph.txt graph-base.sv.gfa graph-base.paf --outputFasta graph-base.sv.gfa.fa --reference ref1 ref2 ref3
cactus-graphmap-split ./js graph.txt graph-base.sv.gfa graph-base.paf --outDir ./chroms --reference ref1 ref2 ref3
cactus-align ./js ./chroms/chromfile.txt graph-base-chrom-alignments --batch --pangenome --reference ref1 ref2 ref3 --outVG
cactus-graphmap-join ./js --workDir . --vg ./*.vg --hal ./*.hal --outDir . --outName graph --reference ref1 ref2 ref3 --vcfReference ref1 ref2 ref3 --gfa --gbz --vcf --clip 10000 --filter 0
vg construct -p -t 96 -N -r ref1.fasta.gz -v graph.vcf > graph.vg
vg index -p -T -t 96 -b . -x graph.xg -G graph.gbwt graph.vg
Missing haplotypes/samples:
vg gbwt -S -H -L -M graph.gbwt
11 paths with names, 1 samples with names, 1 haplotypes, 11 contigs with names
1
_gbwt_ref
In the above case, the gbwt should include 32 haplotypes/samples, but '_gbwt_ref' is returned as the only sample.
However, when the vg graph indices are created during the final graph construction step in the minigraph-cactus creation pipeline --giraffe clip, this same behavior doesn't occur:
cactus-minigraph ./js graph.txt graph-base.sv.gfa --reference ref1 ref2 ref3
cactus-graphmap ./js graph.txt graph-base.sv.gfa graph-base.paf --outputFasta graph-base.sv.gfa.fa --reference ref1 ref2 ref3
cactus-graphmap-split ./js graph.txt graph-base.sv.gfa graph-base.paf --outDir ./chroms --reference ref1 ref2 ref3
cactus-align ./js ./chroms/chromfile.txt graph-base-chrom-alignments --batch --pangenome --reference ref1 ref2 ref3 --outVG
cactus-graphmap-join ./js --workDir . --vg ./*.vg --hal ./*.hal --outDir . --outName graph --reference ref1 ref2 ref3 --vcfReference ref1 ref2 ref3 --gfa --gbz --vcf --clip 10000 --filter 0 --giraffe clip
vg gbwt -Z graph.gbz -o graph.gbwt
See:
vg gbwt -S -H -L -M graph.gbwt
sagemaker-user@default:~/data$ vg gbwt -S -H -L -M hapgraphv1.0-clip10000.gbwt
52354 paths with names, 32 samples with names, 32 haplotypes, 311 contigs with names
32
line1
line2
...
line32
I'm not sure if this is intended behavior, but I wanted to report it in case others encounter it.