Creating a variation graph for Giraffe alignment from assemblies
1
0
Entering edit mode
17 months ago
Michael • 0

I have a collection of ~100 4.5 megabase haploid assemblies that I would like to map to using giraffe. However, I am not completely clear on what the best practices are to construct the graph starting from the assemblies. I have used PGGB to create a GFA with haplotype information, but from the wiki and previous biostars responses vg autoindex --giraffe only works from a VCF + Ref and does not currently support working from a GFA with haplotype information.

I have considered a few options:

1. Manually create all of the indexes for giraffe using the commands found here: https://github.com/vgteam/vg/wiki/Index-Types
2. Use vg deconstruct to create a VCF containing all variation in the PGGB GFA graph relative to one reference, and then use that VCF + ref FASTA to run vg autoindex --giraffe.
3. Use an alternative method to create a VCF from assemblies, although I am not sure which method would be best for this.

I would appreciate any advice on which of these options are best, or for any advice in general about what would be the best practice when constructing graphs directly from haploid assemblies.

vg vgteam • 863 views
3
Entering edit mode
17 months ago
Jouni Sirén ▴ 300

The best practices are still evolving, because there is no standard way of representing a graph that contains both reference paths and haplotype paths. Giraffe itself does not care about the distinction, but you have to specify the reference paths if you want to use downstream tools based on linear references.

We prefer using GFA P-lines for the reference paths and W-lines (from a pull request that has not been merged into the GFA specification) for haplotype paths. I believe PGGB uses P-lines for everything.

We skipped a VG release due to summer vacations, but there should be a new release next week (~August 16). That release will include many improvements to using GFA graphs with Giraffe. If the GFA contains W-lines, vg autoindex will detect it and build Giraffe indexes using both the reference paths and the haplotypes. Otherwise you have to build the indexes manually. You can convert the GFA into GBZ format (GBWT + GBWTGraph) with vg gbwt, build a distance index for the GBZ with vg snarls and vg index, and then build the minimizer index for the GBZ and the distance index with vg minimizer. (Using the distance index for minimizer index construction is important, because Giraffe is much slower without cached distance information.)

If you need reference paths for downstream analysis, things become a bit more difficult. If you started with a GFA with both P-lines and W-lines, you can simply convert the GBZ into XG with vg convert. Otherwise you have to parse GBWT metadata from GFA path names during GBZ construction, which requires that the paths must be named in a consistent way. GBWT path names consist of four components: sample name, contig name, haplotype id, and fragment id. Each name must be unique, but if you leave the fragment id unspecified, it will be used for disambiguating between otherwise identical names. Reference paths must all have the same sample name and distinct contig names (which become path names in the XG graph). In this case, you must specify the reference sample name with vg convert option --ref-sample.

0
Entering edit mode

Thank you for the help! This is very useful. I am a little confused about the terminology for paths though. Each of my reference sequences used to construct the graph are haploid assemblies from homozygous cell lines, so each reference sequence path is also a full haplotype path. Does this mean my P-lines (reference paths) should be duplicated as W-lines (haplotype paths) if I want to use the W-lines feature with GFA? Or do haplotype paths have their own distinct meaning here?

1
Entering edit mode

A reference path in VG terminology is a path that provides a coordinate system. If you want to use downstream tools based on linear sequences, you have to project the alignments from the graph to a reference sequence. It is often assumed that there is one reference path in each graph component.

Haplotype paths are additional paths that inform some VG algorithms which alignments are likely to be true.

We usually assume that reference paths are synthetic sequences, while haplotype paths are true haplotypes. If you have only true haplotypes, they should all be either P-lines or W-lines. With P-lines, you have to use regular expressions for parsing GBWT path names from GFA path names. With W-lines, the required fields already match closely to GBWT path name components. If you are using only one line type, GBWT does not know which set of paths is supposed to be the reference. Hence you have to specify --ref-sample when converting GBZ to XG, regardless of whether you are using P-lines or W-lines.