I would like to perform a gene presence-absence analysis in order to create a matrix indicating the presence/absence of genes in a set of genomes from individuals of the same species. This is what many people would call a pan-genome analysis, and I'd like to see if vg (and variant graphs in general) are a suitable tool.
I am new to vg, so would like some help regarding the recommended steps for performing this type of analysis.
What I have is a set of 10 high quality genome assemblies (fasta), along with corresponding gene annotations (gff3). All genomes are from the same plant species, so are quite similar, but still expected to include structural variation and gene content diversity.
I think the first step would be to construct the graph. What would be the best approach?
Option 1: Map each genome to the reference (using e.g. Minimap2/Mummer) --> call variants --> combine VCFs --> construct graph from reference fasta + VCF
Option 2: perform whole genome MSA (Cactus?) --> construct graph from MSA
Once I have the graph, how should I infer gene presence-absence? can I use the genome assemblies, or should I use short reads in some way? How can I find the coordinates of genes on the graph (and should I)? And how can I use the graph to determine that two genes annotated on different genomes are actually the same gene, present in both genomes?
I hope you can provide some directions or suggest a pipeline for performing the analysis. Thanks!