vg for gene presence-absence analysis
Entering edit mode
2.3 years ago
liorglic ★ 1.4k

I would like to perform a gene presence-absence analysis in order to create a matrix indicating the presence/absence of genes in a set of genomes from individuals of the same species. This is what many people would call a pan-genome analysis, and I'd like to see if vg (and variant graphs in general) are a suitable tool.
I am new to vg, so would like some help regarding the recommended steps for performing this type of analysis.
What I have is a set of 10 high quality genome assemblies (fasta), along with corresponding gene annotations (gff3). All genomes are from the same plant species, so are quite similar, but still expected to include structural variation and gene content diversity.
I think the first step would be to construct the graph. What would be the best approach?
Option 1: Map each genome to the reference (using e.g. Minimap2/Mummer) --> call variants --> combine VCFs --> construct graph from reference fasta + VCF
Option 2: perform whole genome MSA (Cactus?) --> construct graph from MSA

Once I have the graph, how should I infer gene presence-absence? can I use the genome assemblies, or should I use short reads in some way? How can I find the coordinates of genes on the graph (and should I)? And how can I use the graph to determine that two genes annotated on different genomes are actually the same gene, present in both genomes?

I hope you can provide some directions or suggest a pipeline for performing the analysis. Thanks!

vg pan-genome vgteam PAV graph variation • 1.1k views
Entering edit mode
2.3 years ago

In general, inferring paralogy vs orthology is a hard problem. If you already have gene annotations on a reference genome, you might have some luck using the Comparative Annotation Toolkit, which can use whole-genome alignments from Cactus. Ultimately, I think it makes more sense to think about finding the genes in the assemblies rather than "in the graph", but CAT can use the multi-way whole genome alignment (which is similar to a pangenome graph) to guide paralogy/orthology inference.

Entering edit mode

Thanks for the interesting reply. I am not familiar with the CAT and will look into it. Still, I find it a bit strange that current "state of the art" pan-genome projects actually perform two pan-genome construction procedures: one using a variation graph, and another focused on gene presence/absence. There might be a way to combine these two approaches, but AFAIK this had not been done yet.


Login before adding your answer.

Traffic: 2973 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6