I'm attempting to follow this interesting yeast genome paper, which took newly sequenced assemblies, used the UCSC liftOver pipeline to line them up to a reference, and then create a pseudo-molecule that matched the primary yeast reference:
Each de novo assembly was scaffolded against the S. cerevisiae S288c reference genome assembly. The liftOver workflow (Kuhn et al., 2007) was used to determine the coordinates of contigs from each newly assembled strain relative to the reference strain (http://genomewiki.ucsc.edu/index.php/Minimal_Steps_For_LiftOver). Scaffolded contigs mapped to each reference strain chromosome were combined into a “pseudo-molecule,” with the placed contigs stitched together with gaps indicated by “N.” Unplaced contigs (including alternative lower scoring matches) were kept. Unplaced contigs less than 300 nucleotides were not included in the final assembly (Table S1).
I may be overthinking it, but it doesn't seem like creating a chain file (which lists individual regions in the new assembly and where they map to the reference) is the best way to scaffold molecules. The best specification of the chain file format that I've found at genome.ucsc.edu shows that the mapping isn't 1-to-1, e.g. there isn't just one set of "coordinates of contigs from each newly assembled strain relative to the reference strain", e.g. new coordinates in the contigs can be gapped, etc., by definition of what the chain file does? Am I overthinking it? As a process, I imagine something like:
For each position in the reference, use the chain file to find the best contig, and build a sequence from that. Then go to the position at the end of the contig, find the gap until another matched sequence, fill with N's, and repeat. However, what if there are several regions in the new assembly that match the reference? Or what if the new scaffolds overlap?
Does anyone know of an algorithm or tool that does this already? I may be overthinking it, but does anyone have experience on doing this before?