Question: Is it a problem that the reference genome is not at the chromosome level?
gravatar for beausoleilmo
11 months ago by
McGill University
beausoleilmo310 wrote:

I'm studying a species where there is a reference genome that is assembled only at the scaffold level ("unplaced scaffolds"). See here

My question is

  • Do people generally treat a reference genome at the scaffold level as if each scaffold would be a chromosome?
  • Should a scaffold vs chromosome level reference genome be treated differently?
  • What are the main challenges for using a reference genome that is only at the scaffold level?

Basically, I often read in population genetics textbook that we have to study "chromosome". But I have a hard imagine, when having only scaffold, how the theory applies differently.

ADD COMMENTlink modified 11 months ago by Brice Sarver3.5k • written 11 months ago by beausoleilmo310
gravatar for Brice Sarver
11 months ago by
Brice Sarver3.5k
United States
Brice Sarver3.5k wrote:

All bioinformatic applications, including mapping (to a FASTA), will proceed as usual. For annotation, you'll also be fine as long as the contig/scaffold names are the same as the annotation file you're using. This isn't a major problem - the human reference, for example, has a number of unlocalized scaffolds and patch scaffolds that are relevant for annotation but aren't in the set of more thoroughly characterized chromosomes. Contigs/scaffolds/chromosomes are often treated the same by most applications.

So, to answer your questions:

  1. More-or-less.
  2. Not really, but realize that your analysis may be impacted if all the scaffolds can't be localized to true chromosomes (e.g., are two scaffolds in LD because they're next to each other in reality?).
  3. Outlined above, but they're generally captured under issues resulting from 'assembly uncertainty' sensu lato.
ADD COMMENTlink modified 11 months ago • written 11 months ago by Brice Sarver3.5k

Thanks for the answer! I guess that one trick that could be used is LiftOver to try to match the scaffolds to chromosomes of a closely related species (like the more detailed reference genome of the Zebra finch). But probably, LiftOver comes with its own challenges.

ADD REPLYlink written 11 months ago by beausoleilmo310

You could also attempt to localize scaffolds to closely-related species with more sophisticated references using BLAT, if LiftOver (or CrossMap or your favorite alternative) won't work because there's no good whole-genome alignments to create a chainfile. This will work best if the scaffolds are short, else you may need to attempt a whole-genome alignment anyway. A (very) quick peek at UCSC doesn't have Geospiza as a source species.

A quick search reveals that these two species have a median divergence of 30 million years, so you're going to expect quite a few differences.

ADD REPLYlink written 11 months ago by Brice Sarver3.5k

Thanks!! Cool! When you look here, you have to look for "medium ground finch". I don't know why they kept the common english name, but that's how they recorded it...

May I ask you how you got quickly the divergence and what time of divergence would be "roughly" interesting to try to map the scaffolds on chromosomes?

ADD REPLYlink written 11 months ago by beausoleilmo310

Thanks! Didn't see that.

This site provides quick estimates aggregated across a few studies. It generally works as a decent starting place if you don't have the data or computational power to do a full dating analysis, though you can always compare with your favorite paper.

What will map well is a function of the divergence of your sample relative to the reference. You can change mapping parameters from the defaults to let more mismatches through, but you're increasing the uncertainty in your results. You can get a sense of what will be tolerated if you have estimates of substitution rates for the class of loci you're looking at (e.g., intron, exon, UTR, etc.). This can be pretty tricky in practice, but you could try mapping your reads to the exome of a better-annotated species and see what comes out.

For larger sequences with greater divergence, BLAT will be your friend. If your scaffolds are really large, you'll want to look into genome alignments to infer homology.

ADD REPLYlink written 11 months ago by Brice Sarver3.5k

Amazing! Again, thanks a lot for your very nice answer! I appreciate you explained and went further to give me a new intuition on how to approach the problem!!

ADD REPLYlink written 11 months ago by beausoleilmo310
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour