Hello,
I have a question regarding the methodology for comparing the number of SNP called using giraffe on a pangenome-graph and BWA-MEM2 on a linear reference.
I read in publications two different methods.
One converts alignment in .gam to .bam using vg surject, then proceeds with a regular variant calling pipeline with the linear reference used as a backbone to construct the pangenome-graph. I saw this used in several papers, like here or here.
I also saw a second method done here where authors used vg augment from the alignments, followed by vg pack, vg snarl and finally vg call.
Is there a particular method that you would recommend for doing that?
I wish you a nice day, Regards, Marion
Hi,
Thank you for your answer.
My issue has a bit evolved since. I have done the
surjectmethod, which led to a ~20% decrease in reads aligned in the resulting .bam file. Consequently, I have way less variant called than if I just use a regular linear reference with the same downstream variant calling method (GATK in my case).I am now trying to see how I could improve that and if other methods for variant calling on pangenome-graph could be applied to divergent species.
The variant calling method you want does not exist yet.
If you are using short reads, the approach using
vg surjectworks best with graphs without too large structural variants. Otherwise many reads will map to locations that are nowhere near the reference sequence. Those alignments cannot be projected to the reference, andvg surjectwill drop them.vg callworks best for genotyping variants already present in the graph. You can try using it to call novel variants with thevg augmentapproach, but that introduces a lot of noise from sequencing errors and unnormalized edits, andvg calldoes not handle the noise very well.What you want is closer to genome inference than variant calling. You would need a variant caller that works directly with the pangenome graph. After calling variants relative to the graph, you would infer the most likely haplotype paths in the graph and then use the graph to get the alignment between those paths and the paths corresponding to the reference genome.