Hi! Hoping someone here has advice.
I have as input: First, 20ish high quality chromosome level assemblies (e.g.: reference genomes from ncbi) from a single genus (plus two outgroup sequences) with a relatively large genome. Second, a newick phylogeny (created with single copy BUSCO genes).
I want as output: First and primarily, a vcf of snps (on regions where it's possible). Second, only if possible, an ancestral sequence reconstruction fasta from one of the internal nodes on the tree, also chromosome level if possible (but I think it's not). Also, ideally the vcf uses this ancestral sequence as a reference, but ok if not.
I don't care about structural rearrangements at all and it doesn't have to be the best possible method out there. I realize extracting snps from assemblies will not include het sites, which is fine. I also realize I won't get every genomic region, that's fine too.
I've been using cactus but it's driving me nuts. I turned the output hal into a maf, and when I tried to extract the ancestral sequence, it gave me a fasta with 40,000 blocks. Now I somewhat understand why this happened but I don't see an obvious solution, and I think cactus is overkill anyway. Anyway I have been considering using hal2vg then vg2vcf (vg deconstruct) then trying to pull out my reconstruction from there. And then also hal2maf then maf2vcf. But would appreciate thoughts on that.
I've done this kind of analysis on mitochondrial and chloroplast genomes before, which has been relatively straightforward. For that, I would do multiple sequence alignment (e.g. with muscle), then extract a vcf (e.g. with snp-sites), then reconstruct a phylogeny (e.g. with raxml or beast). Then I guess you could just write a little python script to naively but effectively reconstruct an internal node sequence - I haven't done this last step before but it seems feasible.
It seems MUCH harder to scale it up to the whole genome level. Any help appreciated, thanks. :)