I have two GFA files representing the same dataset. I am trying to map reads to these two files using Giraffe. One of them is much smaller than the other (17 MB vs 52 MB). However, the distance index computation (and hence autoindex) is much faster on the larger dataset. As a matter of fact, it only took a couple of seconds on that dataset while it has been running for several hours on the smaller dataset.
I want to know why this is the case and how it can be fixed. I have attached a Google Drive link containing the two datasets with this post.
The command I am using is vg autoindex --workflow giraffe -g <file_name.gfa> -p <output_file_name>
Here are the two GFA files: https://drive.google.com/drive/folders/1mCmgIuVTDthDS7h5iW0PY86-dkFYGCDG?usp=sharing The large one is very fast during indexing while the small one is sluggishly slow.
The distance index's efficiency is determined to a large extent by the complexity of the graph. It tends to work best when the graph looks mostly like a series of "bubbles". If the graph has a much more complicated topology, the index can require a lot of computation to create, and it typically also ends up being quite large. I think it's likely that the reason the one graph is smaller is because it has merged more distant paralogous sequences, leading to a complicated topology, which makes it less amenable to
vg giraffe
's indexing strategies.