Question

Substantial reduction in read mapping following VG surject

1

Entering edit mode

9 months ago

peterdfields ▴ 20

I did a small comparison of total mapping of a small short-read dataset to:

a graph constructed using cactus-minigraph composed of eight whole genome alignments using VG giraffe
the same cactus-minigraph graph, but indexed using VG and mapped to with VG map
a single linear genome mapped to using BWA-mem

The total mapping to the graph (as assessed using VG stats) using both VG giraffe and VG map showed nearly 100% mapping success. Additionally, mapping success to the linear reference using BWA-mem showed nearly 100% success based on SAMTOOLS flagstat. However, when I used VG surject to generate a BAM file from the GAM created using VG giraffe/map (specifying the same reference as used in the BWA-mem test), I see a substantial reduction in successful mapping (from nearly 100% to ~80%). I'm not entirely sure if this is expected behavior, and my general experience of constructing graphs based upon single reference genomes + variant calls plus mapping and surjecting hasn't typically exhibited this sort of mapping reduction. Any advice to better understand why this reduction might be happening would be greatly appreciated.

giraffe cactus-minigraph vg • 857 views

ADD COMMENT • link 8 months ago by peterdfields ▴ 20

0

Entering edit mode

Yes, that's not what we've generally seen either. One thing that would cause that problem is if there's a large quantity of unaligned sequence in the MGC graph. In that case, the read alignments might not overlap any reference position, in which case vg surject doesn't know where to put them. You could check this roughly by comparing the length of sequence in the GFA (by, e.g., grep "^S" graph.gfa | cut -f 3 | wc -c) to the length of the reference genome.

Another question I have is whether you are using the full, clipped, or depth2 graphs that MGC produces. Assuming you are working with human data, a known issue in the full graph is that MGC really has no effective way to align centromere sequences, so they tend to be left as "parallel tracks", which would be mostly unsurjectable. Centromere alpha satellites comprise about 10% of the human genome, so it could partially account for the reduction, but there would still be 10% difference left.

In any case, the current best practice for mapping is to use one of the filtered graphs rather than the full graph. That's how we've gotten the best results, although people are actively working on smarter filtering schemes that could replace the naive methods MGC currently uses.

ADD REPLY • link 9 months ago by Jordan M Eizenga ▴ 460

0

Entering edit mode

Jordan M Eizenga Thank you for your reply, my apologies for not replying sooner. In my case I was using the depth2 graph created by MGC. I believe the issue you brought up concerning repetitive sequences was what was causing the reduction.

ADD REPLY • link 8 months ago by peterdfields ▴ 20