Question

Question about various of graph format. How to get the sequence and source of a specific node?

0

Entering edit mode

14 months ago

chuanj8848 • 0

.gfa is a text-based file that contains the structure of a pan-genome graph. I can write a script to parse this file, but it is time consuming due to its size.

However, there are several other formats used by VG. For example, .gbz, .vg, and .xg. These files are all binary, and I can't intuitively understand what information is contained in them or which information can be extracted from them.

I am wondering if there is any way to get the source and sequence for a specific node/segment. The source might indicate which haplotype contains this node.

vg • 1.4k views

ADD COMMENT • link 12 months ago by chuanj8848 • 0

1

Entering edit mode

vg convert can convert those formats into GFA, and vg chunk can be used to query small graph regions. However, vg chunk loads the entire graph into memory for each query. This makes it fast enough for individual interactive queries, but too slow to be very effective as a backend to programmatic queries. There's development currently underway on a more responsive SQL-based query interface here.

ADD REPLY • link 14 months ago by Jordan M Eizenga ▴ 760

score 0 · Answer 1 · 2024-10-11

Hi,

I had a similar requirement before, where I identified whether certain samples contained specific Nodes from a GFA file. This tool has also been uploaded to GitHub https://github.com/zhangyixing3/pantools

Fri Oct 11 17:50:01 stu_zhangyixing c050~
$ head nodes.20row 
1
2
3
4
5
6
7
8
9
10

Fri Oct 11 17:50:24 stu_zhangyixing c050~
$ gfar pav -g DRB1-3123.w.gfa -n nodes.20row -o testtt
2024/10/11 17:57 [DEBUG]pav.rs:12   GFA file parsed successfully
2024/10/11 17:57 [DEBUG]pav.rs:20   The number of nodes to be analyzed is: 20
2024/10/11 17:57 [DEBUG]pav.rs:29   total number of samples: 12
Done!, gfar version 0.1
CMD: gfar pav -g DRB1-3123.w.gfa -n nodes.20row -o testtt
Real time: 0 sec; CPU: 0 sec; Peak RSS: 0.004 GB


Fri Oct 11 17:57:07 stu_zhangyixing c050~
$ head testtt 
node    sample10    sample9 sample11    sample7 sample12    sample8 sample1 sample3 sample4 sample2 sample6 sample5
19  1   0   0   0   0   0   0   0   0   0   1   0
20  0   1   1   1   0   1   1   0   0   1   0   1
9   0   0   0   0   1   0   0   1   1   0   0   0
13  1   1   1   1   0   1   1   0   0   1   1   1
17  0   1   1   1   0   1   1   0   0   1   0   1
7   1   0   0   0   0   0   0   0   0   0   1   0
11  0   0   0   0   1   0   0   1   1   0   0   0
15  1   0   0   0   0   0   0   0   0   0   1   0
5   1   1   1   0   0   1   1   0   0   1   1   1