Question

Creating a variation graph for Giraffe alignment from assemblies

0

Entering edit mode

2.7 years ago

Michael ▴ 30

I have a collection of ~100 4.5 megabase haploid assemblies that I would like to map to using giraffe. However, I am not completely clear on what the best practices are to construct the graph starting from the assemblies. I have used PGGB to create a GFA with haplotype information, but from the wiki and previous biostars responses vg autoindex --giraffe only works from a VCF + Ref and does not currently support working from a GFA with haplotype information.

I have considered a few options:

Manually create all of the indexes for giraffe using the commands found here: https://github.com/vgteam/vg/wiki/Index-Types
Use vg deconstruct to create a VCF containing all variation in the PGGB GFA graph relative to one reference, and then use that VCF + ref FASTA to run vg autoindex --giraffe.
Use an alternative method to create a VCF from assemblies, although I am not sure which method would be best for this.

I would appreciate any advice on which of these options are best, or for any advice in general about what would be the best practice when constructing graphs directly from haploid assemblies.

vg vgteam • 1.6k views

ADD COMMENT • link updated 2.7 years ago by Jouni Sirén ▴ 360 • written 2.7 years ago by Michael ▴ 30

score 3 · Accepted Answer · 2021-08-12

The best practices are still evolving, because there is no standard way of representing a graph that contains both reference paths and haplotype paths. Giraffe itself does not care about the distinction, but you have to specify the reference paths if you want to use downstream tools based on linear references.

We prefer using GFA P-lines for the reference paths and W-lines (from a pull request that has not been merged into the GFA specification) for haplotype paths. I believe PGGB uses P-lines for everything.

We skipped a VG release due to summer vacations, but there should be a new release next week (~August 16). That release will include many improvements to using GFA graphs with Giraffe. If the GFA contains W-lines, vg autoindex will detect it and build Giraffe indexes using both the reference paths and the haplotypes. Otherwise you have to build the indexes manually. You can convert the GFA into GBZ format (GBWT + GBWTGraph) with vg gbwt, build a distance index for the GBZ with vg snarls and vg index, and then build the minimizer index for the GBZ and the distance index with vg minimizer. (Using the distance index for minimizer index construction is important, because Giraffe is much slower without cached distance information.)

If you need reference paths for downstream analysis, things become a bit more difficult. If you started with a GFA with both P-lines and W-lines, you can simply convert the GBZ into XG with vg convert. Otherwise you have to parse GBWT metadata from GFA path names during GBZ construction, which requires that the paths must be named in a consistent way. GBWT path names consist of four components: sample name, contig name, haplotype id, and fragment id. Each name must be unique, but if you leave the fragment id unspecified, it will be used for disambiguating between otherwise identical names. Reference paths must all have the same sample name and distinct contig names (which become path names in the XG graph). In this case, you must specify the reference sample name with vg convert option --ref-sample.