Do people generally treat a reference genome at the scaffold level as
if each scaffold would be a chromosome?
Should a scaffold vs
chromosome level reference genome be treated differently?
What are
the main challenges for using a reference genome that is only at the
scaffold level?
Basically, I often read in population genetics textbook that we have to study "chromosome". But I have a hard imagine, when having only scaffold, how the theory applies differently.
All bioinformatic applications, including mapping (to a FASTA), will proceed as usual. For annotation, you'll also be fine as long as the contig/scaffold names are the same as the annotation file you're using. This isn't a major problem - the human reference, for example, has a number of unlocalized scaffolds and patch scaffolds that are relevant for annotation but aren't in the set of more thoroughly characterized chromosomes. Contigs/scaffolds/chromosomes are often treated the same by most applications.
So, to answer your questions:
More-or-less.
Not really, but realize that your analysis may be impacted if all the scaffolds can't be localized to true chromosomes (e.g., are two scaffolds in LD because they're next to each other in reality?).
Outlined above, but they're generally captured under issues resulting from 'assembly uncertainty' sensu lato.
Thanks for the answer! I guess that one trick that could be used is LiftOver to try to match the scaffolds to chromosomes of a closely related species (like the more detailed reference genome of the Zebra finch). But probably, LiftOver comes with its own challenges.
You could also attempt to localize scaffolds to closely-related species with more sophisticated references using BLAT, if LiftOver (or CrossMap or your favorite alternative) won't work because there's no good whole-genome alignments to create a chainfile. This will work best if the scaffolds are short, else you may need to attempt a whole-genome alignment anyway. A (very) quick peek at UCSC doesn't have Geospiza as a source species.
A quick search reveals that these two species have a median divergence of 30 million years, so you're going to expect quite a few differences.
Thanks!! Cool! When you look here https://genome.ucsc.edu/cgi-bin/hgLiftOver, you have to look for "medium ground finch". I don't know why they kept the common english name, but that's how they recorded it...
May I ask you how you got quickly the divergence and what time of divergence would be "roughly" interesting to try to map the scaffolds on chromosomes?
This site provides quick estimates aggregated across a few studies. It generally works as a decent starting place if you don't have the data or computational power to do a full dating analysis, though you can always compare with your favorite paper.
What will map well is a function of the divergence of your sample relative to the reference. You can change mapping parameters from the defaults to let more mismatches through, but you're increasing the uncertainty in your results. You can get a sense of what will be tolerated if you have estimates of substitution rates for the class of loci you're looking at (e.g., intron, exon, UTR, etc.). This can be pretty tricky in practice, but you could try mapping your reads to the exome of a better-annotated species and see what comes out.
For larger sequences with greater divergence, BLAT will be your friend. If your scaffolds are really large, you'll want to look into genome alignments to infer homology.
Amazing! Again, thanks a lot for your very nice answer! I appreciate you explained and went further to give me a new intuition on how to approach the problem!!
Thanks for the answer! I guess that one trick that could be used is LiftOver to try to match the scaffolds to chromosomes of a closely related species (like the more detailed reference genome of the Zebra finch). But probably, LiftOver comes with its own challenges.
You could also attempt to localize scaffolds to closely-related species with more sophisticated references using BLAT, if LiftOver (or CrossMap or your favorite alternative) won't work because there's no good whole-genome alignments to create a chainfile. This will work best if the scaffolds are short, else you may need to attempt a whole-genome alignment anyway. A (very) quick peek at UCSC doesn't have Geospiza as a source species.
A quick search reveals that these two species have a median divergence of 30 million years, so you're going to expect quite a few differences.
Thanks!! Cool! When you look here https://genome.ucsc.edu/cgi-bin/hgLiftOver, you have to look for "medium ground finch". I don't know why they kept the common english name, but that's how they recorded it...
May I ask you how you got quickly the divergence and what time of divergence would be "roughly" interesting to try to map the scaffolds on chromosomes?
Thanks! Didn't see that.
This site provides quick estimates aggregated across a few studies. It generally works as a decent starting place if you don't have the data or computational power to do a full dating analysis, though you can always compare with your favorite paper.
What will map well is a function of the divergence of your sample relative to the reference. You can change mapping parameters from the defaults to let more mismatches through, but you're increasing the uncertainty in your results. You can get a sense of what will be tolerated if you have estimates of substitution rates for the class of loci you're looking at (e.g., intron, exon, UTR, etc.). This can be pretty tricky in practice, but you could try mapping your reads to the exome of a better-annotated species and see what comes out.
For larger sequences with greater divergence, BLAT will be your friend. If your scaffolds are really large, you'll want to look into genome alignments to infer homology.
Amazing! Again, thanks a lot for your very nice answer! I appreciate you explained and went further to give me a new intuition on how to approach the problem!!