Is there any way to associate the pathname in vg graph with sample ID? details: the vg graph is generated from variants in vcf of multiple samples How to know each path of vg graph is associated or tagged with variant source tag, e.g. sample ID? Thanks.
VG assumes that path names are opaque strings. While some path names starting with _
(e.g. _alt_*
and _thread_*
) are used for technical purposes, VG generally does not understand the information encoded in path names.
In VG terminology, there is a conceptual difference between paths and threads:
- Paths are defined simultaneously as node sequences and nucleotide sequences. They are stored in the graph itself, and most graph implementations support random access within the paths. Storing many paths generally requires a large amount of space.
- Threads are lightweight paths that are only defined as node sequences. They are stored in a GBWT index, which only supports sequential access to the threads. If the threads are similar enough, they can be stored very space-efficiently.
Unlike path names, thread names are structured. They consist of four fields: sample name, contig name, phase identifier, and running count / fragment identifier. If multiple contig names are used, many VG algorithms assume that contig names match the names of the paths embedded in the graph.
Graphs built with vg construct
ignore sample information. However, if option -a
is used during construction, variants will be stored as paths in the graph (using _alt_
prefix for the names). With these alt paths and the VCF files, you can then generate threads for the samples and store them in a GBWT index with the vg index
subcommand. There is some documentation in the vg wiki.