Hi VG team,
I’m working on constructing a pangenome graph for Anelloviridae. My goal is the following:
1. Circularize each individual genome (using vg circularize).
2. Embed an ORF1 path in each graph via vg map + vg augment.
3. Combine all those augmented graphs into a single unified reference (vg combine), and then:
4. Map metagenomic reads against this global reference.
However, I’m running into several issues during the process. Here’s a simplified summary of the commands I’m using:
Graph creation per sample (in loop):
vg construct -r sample.fasta > sample.vg
vg paths -Lv sample.vg > sample.paths
vg circularize sample.vg -P sample.paths > sample.circ.vg
vg index -x sample.circ.xg sample.circ.vg
vg gbwt -x sample.circ.xg -o sample.circ.gbwt -P --pass-paths
vg index -g sample.circ.gcsa sample.circ.xg
vg convert -p sample.circ.xg > sample.circ.pg
vg map -F ORF1.fasta -x sample.circ.xg -g sample.circ.gcsa -m long > sample.orf1.gam
vg augment -i sample.circ.pg sample.orf1.gam > sample_wORF1.vg
Then I sanitize and prefix all paths and run:
vg combine *_wORF1_prefixed.gfa > combined.vg
vg convert -f combined.vg > combined.gfa
gfaffix combined.gfa -o combined.fix.gfa
vg convert -f combined.fix.gfa -p > combined.pg
vg index -x combined.xg combined.pg
vg gbwt -x combined.xg -o combined.gbwt -P --pass-paths
vg prune -u -g combined.gbwt -k 31 -m combined.node_mapping combined.pg > combined.pruned.vg
vg index -g combined.gcsa -f combined.node_mapping combined.pruned.vg
In some cases, I get errors like:
InputGraph::InputGraph(): Cannot open node mapping file All_anelloviridae_2.node_mapping
I think cannot find paths in graph
In other cases, I get an error related with duplicate path name found in graph even after prefixing each path with the sample name during vg combine.
I’ve confirmed that: • Path names are unique (I overwrite with sample_ prefix using awk). • Each *_wORF1.vg has a proper embedded path. • The ORF1 is always mapped and augmented correctly.
Questions: • Is this the correct sequence for building a circular multi-sample graph with embedded ORF1 paths? • Should I embed ORF1 before or after circularization? • Does vg combine preserve embedded paths in a way that vg gbwt can consume safely? • Is there a recommended way to ensure vg gbwt doesn’t complain about duplicate or missing paths after vg combine?
Any suggestions or clarifications would be appreciated. I’m happy to provide example files if useful.
Thanks,
Flor @florenmartino
That error is caused by a simple failure to open that file (which is created in the
vg prune
). Usually, that results from a typo in the file name or from inadequate permissions to read the file, although there are more exotic causes as well.Using your strategy of mapping in the ORFs, I think you are right to do it after circularizing. Otherwise you won't be able to align ORFs that cross the 0 position on the FASTA sequence.
Regarding the path names and
vg combine
, I don't think there should be any issue as long as the path names are unique. Did you also check to ensure that the sequence names in the FASTAs are unique?