Indexing the human pangenome draft
1
1
Entering edit mode
9 months ago
lisle.mose ▴ 20

Hi,

I am attempting to create a VG index against the human pangenome draft using vg autoindex. Here is the command:

vg autoindex --gfa hprc-v1.0-mc-grch38-minaf.0.1.gfa --tmp-dir /home/ec2-user/pangenome/tmp

vg has been running for about a week now and I've seen the following in the logs 12 times so far:

[IndexRegistry]: Pruning complex regions of VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Size limit exceeded, construction aborted
warning:[IndexRegistry] Child process 66427 failed with status 256 representing exit code 1
[IndexRegistry]: Exceeded disk use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.

Over 2TB of disk space and just under 1TB of RAM are available on the machine vg is running on.

The xg index appears to have built successfully. The .gcsa and .gcsa.lcp files are both of size zero bytes.

Ultimately, I'd like to be able to map a small number of short sequences (as small as 20nt) to the pan-genome and am particularly interested in structural variants. The index distributed with the human pangenome draft appears to be for giraffe which does not appear to support sequences this short.

Any pointers on how to build the index more efficiently or other ways of mapping these short sequences would be appreciated.

Thanks!

Pangenome VG • 988 views
ADD COMMENT
0
Entering edit mode

FYI, the process was eventually killed after about 10 days. Nothing new in the logs and the .gcsa and .lcp file sizes are still zero.

ADD REPLY
1
Entering edit mode

Hi Lisle, I've replicated this behavior locally and will look into the cause.

ADD REPLY
2
Entering edit mode
9 months ago

I've figured out what's going on, and as a hot fix you can remove all of the "W" lines from the GFA like this:

grep -v "^W" hprc-v1.0-mc-grch38-minaf.0.1.gfa > hprc-v1.0-mc-grch38-minaf.0.1.no_w.gfa

That GFA should be indexable. Sometime soon, I'll update the logic to replicate this behavior even without removing the W lines.

ADD COMMENT
0
Entering edit mode

Thanks! That worked.

ADD REPLY

Login before adding your answer.

Traffic: 1738 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6