In 'vg construct', is it possible to have the variant paths be labeled with sample ID?
1
0
Entering edit mode
10 months ago
egoltsman • 0

Hello vg gurus, When inducing a graph from a vcf, I'd like to have the sample-threads show up in the graph as paths instead of it being the individual variant ids (the latter seems to be the default behavior with 'vg construct -a'). It would be very useful to know which segments were induced by which sample(s). If this is not possible to do with 'construct', is there another way to add sample-coherent threads as paths? Mapping the sequences back to the graph, one sample at a time, and then assigning the paths based on the mapping seems to be too error-prone. It seems that getting the paths at the time of the initial graph construction/augmentation would be a more natural way of doing it.
Thanks!

vg • 444 views
0
Entering edit mode
10 months ago
Jouni Sirén ▴ 190

The original sequences can be stored as threads (lightweight paths) in a GBWT index. See the index construction wiki page for details on building the GBWT. Storing a large number of paths in the graph itself is usually not a good idea, because the paths are not very space-efficient.

If you want to extract the threads for a specific sample and add them as paths to the graph, you can do it with the vg paths and vg augment subcommands:

vg paths -v graph.vg -g graph.gbwt -S <sample> -X > <sample>.gam
vg augment -B graph.vg <sample>.gam > augmented.vg


The new paths will have names of the form _thread_<sample>_<contig>_<phase>_<fragment>.

0
Entering edit mode

Thank you very much, I'll try that.

0
Entering edit mode

Sorry, I'm not sure I fully understand the index construction page and so I wasn't able to get to the sample thread(s) aside from the "reference" thread that's already stored in the graph. To clarify, the vcf file I'm inducing the graph from is pretty minimal and doesn't have haplotype info. Here's an example:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Bd1_1_CHR1 Ref_CHR1 61293 8203 N <del> 6975.37 . SVTYPE=DEL;SVLEN=-1965;END=63258;STRANDS=+-:708;CIPOS=-2,2;CIEND=-1,2;CIPOS95=0,0;CIEND95=0,0;SU=708;PE=514;SR=194;SNAME=Bd1_1_CHR1:1;AC=59 GT 1/1

I induced the graph with the following command: vg construct -f -S -m 1024 -p -r BdistachyonRef_CHR1.fa -v foo.vcf > graph.vg

When I run vg path -E -v graph.vg I get what looks like individual variant traversal paths, plus the reference sequence path which spans the length of the graph:

_alt_24084deba4d30ec586cc795c6023dd3628692000_0 3203 _alt_cb1c6e16d06bd350a36005792afc62eb43dfae2e_1 1428 _alt_80f10350f6ed0942d6e5669440b7983b341177a0_1 0 _alt_a4b2aca0b177f9fcf1239a48c5d361417958fc8f_0 1326 _alt_5b6b2d7421e4dbd5797f364c0a8ff3bf3a4b3a6e_0 4116 _alt_03e6014294926b896eed0da6be10b5c4313dc643_0 6606 _alt_7bf24001ce66c7a603a816557d25256c7df93050_1 0 _alt_d004a29cb973ef03b575201624a08320c391c0ca_0 1756 _alt_357846b6c19f1b036eb2eb1b4b69d994b98d9581_0 3994 _alt_eeed22b0bddfc7e621cd07deb8915bc54ae4ff55_0 5057 Ref_CHR1 75071545

The thread/path I'm interested in is for the sample Bd1_1_CHR1. I imagine that before a thread can be stored in any index, the original vg file must capture it in some way, but in this case it didn't. Btw, I'm runnin vg version 1.28

Thanks again