Question

Is it a problem that the reference genome is not at the chromosome level?

2

Entering edit mode

5.0 years ago

beausoleilmo ▴ 590

I'm studying a species where there is a reference genome that is assembled only at the scaffold level ("unplaced scaffolds"). See here https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Geospiza_fortis/101/.

My question is

Do people generally treat a reference genome at the scaffold level as if each scaffold would be a chromosome?
Should a scaffold vs chromosome level reference genome be treated differently?
What are the main challenges for using a reference genome that is only at the scaffold level?

Basically, I often read in population genetics textbook that we have to study "chromosome". But I have a hard imagine, when having only scaffold, how the theory applies differently.

Chromosome reference genome scaffolds • 1.9k views

ADD COMMENT • link updated 5.0 years ago by Brice Sarver ★ 3.8k • written 5.0 years ago by beausoleilmo ▴ 590

score 2 · Accepted Answer · 2019-08-06

2

Entering edit mode

5.0 years ago

Brice Sarver ★ 3.8k

All bioinformatic applications, including mapping (to a FASTA), will proceed as usual. For annotation, you'll also be fine as long as the contig/scaffold names are the same as the annotation file you're using. This isn't a major problem - the human reference, for example, has a number of unlocalized scaffolds and patch scaffolds that are relevant for annotation but aren't in the set of more thoroughly characterized chromosomes. Contigs/scaffolds/chromosomes are often treated the same by most applications.

So, to answer your questions:

More-or-less.
Not really, but realize that your analysis may be impacted if all the scaffolds can't be localized to true chromosomes (e.g., are two scaffolds in LD because they're next to each other in reality?).
Outlined above, but they're generally captured under issues resulting from 'assembly uncertainty' sensu lato.

ADD COMMENT • link 5.0 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Thanks for the answer! I guess that one trick that could be used is LiftOver to try to match the scaffolds to chromosomes of a closely related species (like the more detailed reference genome of the Zebra finch). But probably, LiftOver comes with its own challenges.

ADD REPLY • link 5.0 years ago by beausoleilmo ▴ 590

1

Entering edit mode

You could also attempt to localize scaffolds to closely-related species with more sophisticated references using BLAT, if LiftOver (or CrossMap or your favorite alternative) won't work because there's no good whole-genome alignments to create a chainfile. This will work best if the scaffolds are short, else you may need to attempt a whole-genome alignment anyway. A (very) quick peek at UCSC doesn't have Geospiza as a source species.

A quick search reveals that these two species have a median divergence of 30 million years, so you're going to expect quite a few differences.

ADD REPLY • link 5.0 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Thanks!! Cool! When you look here https://genome.ucsc.edu/cgi-bin/hgLiftOver, you have to look for "medium ground finch". I don't know why they kept the common english name, but that's how they recorded it...

May I ask you how you got quickly the divergence and what time of divergence would be "roughly" interesting to try to map the scaffolds on chromosomes?

ADD REPLY • link 5.0 years ago by beausoleilmo ▴ 590

1

Entering edit mode

Thanks! Didn't see that.

This site provides quick estimates aggregated across a few studies. It generally works as a decent starting place if you don't have the data or computational power to do a full dating analysis, though you can always compare with your favorite paper.

What will map well is a function of the divergence of your sample relative to the reference. You can change mapping parameters from the defaults to let more mismatches through, but you're increasing the uncertainty in your results. You can get a sense of what will be tolerated if you have estimates of substitution rates for the class of loci you're looking at (e.g., intron, exon, UTR, etc.). This can be pretty tricky in practice, but you could try mapping your reads to the exome of a better-annotated species and see what comes out.

For larger sequences with greater divergence, BLAT will be your friend. If your scaffolds are really large, you'll want to look into genome alignments to infer homology.

ADD REPLY • link 5.0 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Amazing! Again, thanks a lot for your very nice answer! I appreciate you explained and went further to give me a new intuition on how to approach the problem!!

ADD REPLY • link 5.0 years ago by beausoleilmo ▴ 590