Lack of correspondance of GFA node IDs to giraffe/call node IDs
3
0
Entering edit mode
6 months ago

I have a GFA graph built with PGGB using several samples. I want to genotype some other samples with short reads using VG Giraffe.

After investigating how to generate the corresponding indexes for VG Giraffe from a GFA generated with PGGB, I think I have found a way to do it. However, the node IDs obtained after VG call do not correspond to the node IDs of the original GFA. Is there any alternative way to the one I did to solve this problem?

Here how I created the indexes and how I used VG giraffe and VG call:

vg autoindex --workflow giraffe -p prefix -g pggb.gfa -R XG

vg giraffe -Z pggb.giraffe.gbz -x pggb.xg -f reads1_1.fastq -f reads1_2.fastq > graph.gam

vg pack -x pggb.xg -g graph.gam -o graph.pack

vg call pggb.xg -k graph.pack -p reference_path > graph.vcf

Here the vcf generated after VG Call:

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1     119     >4>6    AG      A       319.823 PASS    AT=>4>5>6,>4>6;DP=14    GT:DP:AD:GL:GQ:GP:XD:MAD        1/1:14:0,14:-33.191784,-5.222199,-1.686644:35:-1.098904:14.409575:14
chr1     3881    >123>126        T       A       731.864 PASS    AT=>123>124>126,>123>125>126;DP=32      GT:DP:AD:GL:GQ:GP:XD:MAD        1/1:32:0,32:-74.786923,-10.856443,-2.077491:87:-1.098612:31.617285:32
chr1     9995    >317>320        A       T       1098.81 PASS    AT=>317>318>320,>317>319>320;DP=48      GT:DP:AD:GL:GQ:GP:XD:MAD        1/1:48:0,48:-111.791357,-15.895637,-2.387109:135:-1.098612:43.424244:48

And here the vcf generated with VG deconstruct from the gfa graph (note that the positions are the same but the node IDs are different).

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
Chr          1119     >1>3    AG      A       60      .       AT=>1>2>3,>1>3
Chr1        3881    >3>6    T       A       60      .       AT=>3>4>6,>3>5>6
Chr1        9995    >6>9    A       T       60      .       AT=>6>7>9,>6>8>9

Thanks in advance

giraffe pggb vg • 777 views
ADD COMMENT
0
Entering edit mode
6 months ago
glenn.hickey ▴ 520

The issue is likely that .gbz creation chops your nodes down to 1024bp. This is described in more detail here (and will apply to any graph, not just minigraph-cactus):

https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md#node-chopping

In theory, the .gbz keeps a table mapping from the new to the old IDs. vg call can use this table to output in terms of the original IDS with the vg call -O option. I'm not sure why this doesn't appear in the vg call help (a bug to be fixed), but looking at the code it looks like it should work.

tldr try adding -O to your vg call invocation.

ADD COMMENT
0
Entering edit mode

I'll also add that you should not be using the -R option in vg autoindex. There's a reason that this option is not included in the help output for vg autoindex. We don't maintain stable behavior for it, and the results can be unpredictable unless you are familiar with the inner workings of vg autoindex. It's really more of a developer tool. If you want an XG file that matches your GBZ, you should use vg convert -x. In addition, most VG commands whose help documentation says xg actually accept other graph formats (including GBZ) as well. The "XG" language is mostly legacy from when the formats were more rigidly enforced.

ADD REPLY
0
Entering edit mode

Thanks a lot for your help. In fact, the main problem is that when converting the GFA file from pggb to GBZ format with vg autoindex, I lose the reference paths that I had in the GFA. That's why I included the XG file to try to solve the problem.

That is, when I type vg paths -Lx graph.gfa, I get something like this, that correspond to my original assemblies:

DB#1#vat#0
AN#1#vat#0

But when I type vg paths -Lx graph.gbz, I get this output that I can't understand:

0#0#0#0
1#0#0#0
2#0#0#0
3#0#0#0
4#0#0#0
5#0#0#0
6#0#0#0
7#0#0#0
8#0#0#0
9#0#0#0
10#0#0#0
11#0#0#0
12#0#0#0
13#0#0#0
14#0#0#0
15#0#0#0

There is any way to keep the reference paths when converting from GFA to GBZ?

ADD REPLY
0
Entering edit mode
6 months ago
glenn.hickey ▴ 520

One way is to add a "RS" (Reference Sample) tag to your GFA header

H   VN:Z:1.1    RS:Z:<sample1> <sample2> etc

where the samples are space-separated. This way any path of the form <sample1>#0#chr1 etc will end up as a reference path in the GBZ.

Note that the GFA fields are tab-separated, but the sample names are space-separated.

ADD COMMENT
0
Entering edit mode
5 months ago
Wenhai • 0

I think you should use the command vg giraffe --named-coordinates, it can preserve the node IDs from original GFA file in graph.gam.

ADD COMMENT

Login before adding your answer.

Traffic: 2805 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6