I am looking to infer core and accessory genomes for ~ 300 fungal strains of the same species. Their drafts are of different build quality. One goal is to try and make predictions for how many core and accessory genes may be missing from each of my draft genomes, based on how many conserved genes are missing (using results from a tool such as BUSCO or CEGMA etc) - and therefore, to come up with both an empirical result based on as-as data, but also a more expanded dataset based on simulating how it would have looked had all genomes been completed ones.
My understanding is that it is common for core and accessory to evolve at two significantly different rates. And as such I am not sure if BUSCO results will be a surrogate for extrapolation to the accessory genome (core genome may be OK, I suppose?).
With that as context, here are my questions:
1. How can I simulate the expected number of core genes, given that nearly all of my draft genomes do not contain all expected BUSCO genes (but to different degrees)?
2. How should I change this simulation for accessory genes, given how these evolve differently from core genes?
3. I looked around in literature, but was unable to find a directly relevant paper. Are there any prior published work looking into this?