vg dist on (large) graph runtime?
2
0
Entering edit mode
2 days ago
kingcohn ▴ 30

hello, I'm looking to map short, paired-end Illumina reads to my PGGB graph (chromosome level) using vg. I've generated a gbz, combined from GBWT and GBWTgraph files, but the program is hung up on [IndexRegistry]: Constructing distance index for Giraffe.

Here are the steps I used after generating the gfa:

$ ./vg gbwt -G ../LowerPI_Cglan_pggb/Curculio_Chrom1_revisedPanSN.fasta.2afcd0e.11fba48.33b105f.smooth.final.gfa --num-threads 64 -p -d $(pwd) -L -o Cg1CcCn2.gbwt -g Cg1CcCn2.gg
Building input GBWTs
Input type: GFA
Opening GFA file ../LowerPI_Cglan_pggb/Curculio_Chrom1_revisedPanSN.fasta.2afcd0e.11fba48.33b105f.smooth.final.gfa
Validating GFA file ../LowerPI_Cglan_pggb/Curculio_Chrom1_revisedPanSN.fasta.2afcd0e.11fba48.33b105f.smooth.final.gfa
Found 5369317 segments, 7286546 links, 3 paths, and 0 walks in 570.024 seconds
Storing generic named paths as sample _gbwt_ref
GBWT insertion batch size: 101953820 nodes
Parsing segments
Breaking segments into 1024 bp nodes
Parsed 5881329 nodes in 6.71451 seconds
Parsing links
Parsed 7798558 edges in 4.67029 seconds
Creating jobs
Created 1 jobs for 1 components in 2.3017 seconds
Parsing metadata
Metadata: 3 paths with names, 3 samples with names, 3 haplotypes, 2 contigs with names
Parsed metadata in 0.000147939 seconds
Indexing paths/walks
Starting job 0 (5881329 nodes, 3 paths, 0 walks)
Finished job 0 in 17.5745 seconds
Merging partial indexes
Indexed 3 paths and 0 walks in 20.5536 seconds
Parsing GFA header tags
Parsed header tags in 1.01738e-05 seconds
GBWTs built in 604.867 seconds, 4.86007 GiB

Saving compressed GBWT to Cg1CcCn2.gbwt
GBWT serialized in 19.4852 seconds, 4.86007 GiB

Building GBWTGraph
Saving GBWTGraph to Cg1CcCn2.gg
GBWTGraph built in 295.132 seconds, 4.86007 GiB

& current command...

$ ./vg giraffe -p -g Cg1CcCn2.gg -f trimmed/S471.R1.trim.fq.gz -f trimmed/S471.R2.trim.fq.gz --read-group "@RG\tID:S471\tSM:S471\tPL:ILLUMINA"  -H Cg1CcCn2.gbwt >> Cg_C1/S471_index.gam
Preparing Indexes
[IndexRegistry]: Combining Giraffe GBWT and GBWTGraph into GBZ.
[IndexRegistry]: Constructing distance index for Giraffe.

any insight into runtime, resources or ways of expediting this step would be great! Thank you.

vg • 344 views
ADD COMMENT
0
Entering edit mode

Moderation: changed from 'forum' to 'question'

ADD REPLY
0
Entering edit mode
17 hours ago
Jouni Sirén ▴ 770

It looks like your issue is slow disk I/O. The initial pass over the GFA file takes 570 seconds, while the computational part of GBWT construction takes 35 seconds. If you have the files on a congested network drive, that could easily make distance index construction slow. Especially because some distributed file systems do not handle memory-mapped files well. It is possible that you can solve the problem by having the files on a local disk.

Additionally, it is better to build the indexes first using vg autoindex with the GFA file, and only run vg giraffe after all indexes have been built.

ADD COMMENT
0
Entering edit mode

Thank you for the reply! So, just to confirm this isn't a graph issue given the PGGB construction and node density? Could you suggest ways to expedite using slurm HPC? Much appreciated!

#!/bin/bash
#SBATCH -J index_giraffe_gbwt
#SBATCH -N 1
#SBATCH -n 72
#SBATCH --mem=300G
#SBATCH -p longq7-eng
#SBATCH -t 72:00:00
#SBATCH -o logs/%x-%j.out
set -euo pipefail


#new approach re:Jouni

./vg autoindex -p -g ../LowerPI_Cglan_pggb/Curculio_Chrom1_revisedPanSN.fasta.2afcd0e.11fba48.33b105f.smooth.final.gfa -r /mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/MinigraphCactus/split_fa/Cglandium.1.fasta -w sr-giraffe --prefix Cg1CcCn2.giraffe_119

#./vg giraffe -p -g Cg1CcCn2.gg -f trimmed/S471.R1.trim.fq.gz -f trimmed/S471.R2.trim.fq.gz --read-group "@RG\tID:S471\tSM:S471\tPL:ILLUMINA"  -H Cg1CcCn2.gbwt >> Cg_C1/S471_index.gam
ADD REPLY
0
Entering edit mode

/mnt/archive/groups/degiorgio_group ...

Looking at the name this is likely remote/mounted storage. This may be the bottleneck in terms of speed that @Jouni was referring to. If possible try to copy/write the data to local disks while computing, which should speed things up. After the job completes, you can copy/move the output to desired storage (like above).

ADD REPLY
0
Entering edit mode
3 hours ago
kingcohn ▴ 30

Thanks all! It looks like it finished and generated the dist, zipcodes & withzip.min files! But now giraffe is failing with a similar error as this one on the issues page: https://github.com/vgteam/vg/issues/4556

here's my command and err/log. Also, the zipcodes file is <10 bytes?

9 Nov  9 14:38 Cg1CcCn2.giraffe_119.shortread.zipcodes
1.3G Nov  9 01:56 Cg1CcCn2.giraffe_119.shortread.withzip.min
69M Nov  9 01:46 Cg1CcCn2.giraffe_119.dist
126M Nov  9 01:41 Cg1CcCn2.giraffe_119.giraffe.gbz
$ ./vg giraffe -p -Z Cg1CcCn2.giraffe.gbz -m Cg1CcCn2.giraffe_119.shortread.withzip.min -d Cg1CcCn2.giraffe_119.dist -z Cg1CcCn2.giraffe_119.shortread.zipcodes -f trimmed/S471.R1.trim.fq.gz -f trimmed/S471.R2.trim.fq.gz > Cg_C1/S471_index.gam

`` Guessing that Cg1CcCn2.gbwt is Giraffe GBWT Guessing that Cg1CcCn2.gg is GBWTGraph Preparing Indexes Loading Minimizer Index ls -Loading Zipcodes Loading GBZ Loading Distance Index v2 Paging in Distance Index v2 Initializing MinimizerMapper Loading and initialization: 551.365 seconds Of which Distance Index v2 paging: 69.248 seconds Mapping reads to "-" (GAM) --watchdog-timeout 10 --batch-size 512 --match 1 --mismatch 4 --gap-open 6 --gap-extend 1 --full-l-bonus 5 --max-multimaps 1 --hit-cap 10 --hard-hit-cap 500 --score-fraction 0.9 --max-min 500 --min-coverage-flank 250 --num-bp-per-min 1000 --downsample-window-length 18446744073709551615 --downsample-window-count 0 --distance-limit 200 --max-extensions 800 --max-alignments 8 --cluster-score 50 --pad-cluster-score 20 --cluster-coverage 0.3 --max-extension-mismatches 4 --extension-score 1 --extension-set 20 --rescue-attempts 15 --max-fragment-length 2000 --paired-distance-limit 2 --rescue-subgraph-size 4 --rescue-seed-limit 100 --mapq-score-window 0 --mapq-score-scale 1 --zipcode-tree-score-threshold 50 --pad-zipcode-tree-score-threshold 20 --zipcode-tree-coverage-threshold 0.3 --zipcode-tree-scale 2 --min-to-fragment 4 --max-to-fragment 10 --max-direct-chain 0 --gapless-extension-limit 0 --fragment-max-graph-lookback-bases 300 --fragment-max-graph-lookback-bases-per-base 0.03 --fragment-max-read-lookback-bases 18446744073709551615 --fragment-max-read-lookback-bases-per-base 1 --max-fragments 18446744073709551615 --fragment-max-indel-bases 2000 --fragment-max-indel-bases-per-base 0.2 --fragment-gap-scale 1 --fragment-points-per-possible-match 0 --fragment-score-fraction 0.1 --fragment-max-min-score 1.79769e+308 --fragment-min-score 60 --fragment-set-score-threshold 0 --min-chaining-problems 1 --max-chaining-problems 2147483647 --max-graph-lookback-bases 3000 --max-graph-lookback-bases-per-base 0.3 --max-read-lookback-bases 18446744073709551615 --max-read-lookback-bases-per-base 1 --max-indel-bases 2000 --max-indel-bases-per-base 0.2 --item-bonus 0 --item-scale 1 --gap-scale 1 --rec-penalty-chain 0 --rec-penalty-fragment 0 --points-per-possible-match 0 --chain-score-threshold 100 --min-chains 4 --min-chain-score-per-base 0.01 --max-min-chain-score 200 --max-skipped-bases 0 --max-chains-per-tree 1 --max-chain-connection 100 --max-tail-length 100 --max-dp-cells 18446744073709551615 --max-tail-gap 18446744073709551615 --max-middle-gap 18446744073709551615 --max-tail-dp-length 30000 --max-middle-dp-length 2147483647 --wfa-max-mismatches 2 --wfa-max-mismatches-per-base 0.1 --wfa-max-max-mismatches 20 --wfa-distance 10 --wfa-distance-per-base 0.1 --wfa-max-distance 200 --softclip-penalty 0 --min-unique-node-fraction 0 --rescue-algorithm dozeu vg: src/aligner.cpp:934: size_t vg::GSSWAligner::longest_detectable_gap(size_t, size_t) const: Assertionread_length >= read_pos' failed.

Crash report for vg v1.69.0 "Bologna" Caught signal 6 raised at address 0x238188c; tracing with backward-cpp Stack trace (most recent call last) in thread 839073:

20 Object "", at 0xffffffffffffffff, in

warning[vg::Watchdog]: Thread 39 has been checked in for 10 seconds processing: A00755:303:HVG75DSX3:2:1101:3179:33802, A00755:303:HVG75DSX3:2:1101:3179:33802

19 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x24269f3, in __clone

18 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x237ff2a, in start_thread

17 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x23227bd, in gomp_thread_start

16 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x1086c95, in unsigned long vg::io::paired_for_each_parallel_after_waitvg::Alignment(std::function<bool (vg::Alignment&, vg::Alignment&)>, std::function<void (vg::Alignment&, vg::Alignment&)>, std::function<bool ()>, unsigned long) [clone ._omp_fn.0]

15 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0xea7573, in std::_Function_handler<void (vg::Alignment&, vg::Alignment&), main_giraffe(int, char**)::{lambda()#1}::operator()() const::{lambda(vg::Alignment&, vg::Alignment&)#9}>::_M_invoke(std::_Any_data const&, vg::Alignment&, vg::Alignment&)

14 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x1391b3c, in vg::MinimizerMapper::map_paired(vg::Alignment&, vg::Alignment&, std::vector<std::pair<vg::Alignment, vg::Alignment>, std::allocator<std::pair<vg::Alignment, vg::Alignment> > >&)

13 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x1386abc, in vg::MinimizerMapper::map_from_extensions(vg::Alignment&)

12 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x13830cd, in void vg::MinimizerMapper::process_until_threshold_b<int>(std::vector<int, std::allocator<int> > const&, double, unsigned long, unsigned long, vg::LazyRNG&, std::function<bool (unsigned long, unsigned long)> const&, std::function<void (unsigned long)> const&, std::function<void (unsigned long)> const&) const [clone .isra.0]

11 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x1397d57, in vg::MinimizerMapper::map_from_extensions(vg::Alignment&)::{lambda(unsigned long, unsigned long)#8}::operator()(unsigned long, unsigned long) const [clone .constprop.0]

10 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x1396065, in vg::MinimizerMapper::find_optimal_tail_alignments(vg::Alignment const&, std::vector<vg::GaplessExtension, std::allocatorvg::GaplessExtension > const&, vg::LazyRNG&, vg::Alignment&, vg::Alignment&) const

9 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x13836aa, in void vg::MinimizerMapper::process_until_threshold_c<double>(unsigned long, std::function<double (unsigned long)> const&, std::function<bool (unsigned long, unsigned long)> const&, double, unsigned long, unsigned long, vg::LazyRNG&, std::function<bool (unsigned long, unsigned long)> const&, std::function<void (unsigned long)> const&, std::function<void (unsigned long)> const&) const [clone .isra.0]

8 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x139bf1e, in vg::MinimizerMapper::find_optimal_tail_alignments(vg::Alignment const&, std::vector<vg::GaplessExtension, std::allocatorvg::GaplessExtension > const&, vg::LazyRNG&, vg::Alignment&, vg::Alignment&) const::{lambda(unsigned long, unsigned long)#2}::operator()(unsigned long, unsigned long) const [clone .constprop.0]

7 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x139b42c, in vg::MinimizerMapper::get_tail_forest(vg::GaplessExtension const&, unsigned long, bool, unsigned long*) const

6 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x1068450, in vg::GSSWAligner::longest_detectable_gap(unsigned long, unsigned long) const

5 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x1060a74, in vg::GSSWAligner::longest_detectable_gap(unsigned long, unsigned long) const [clone .part.0]

4 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x234e725, in __assert_fail

3 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x637e3b, in __assert_fail_base.cold

2 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x637f13, in abort

1 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x2354d35, in raise

0 Object "/mnt/archive/groups/degiorgio_group/zpc/Curculio_catChromos/variantgraph/vg", at 0x238188c, in __pthread_kill

Library locations: ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.

Context dump: Thread 0: Starting 'giraffe' subcommand Thread 39: A00755:303:HVG75DSX3:2:1101:3179:33802, A00755:303:HVG75DSX3:2:1101:3179:33802 Found 2 threads with context.

Please include this entire error log in your bug report!

ADD COMMENT

Login before adding your answer.

Traffic: 4175 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6