5.2 years ago by
N.B., many people do de novo assembly due to having no (or a poor quality) reference sequence, so aligning to a reference isn't really a must.
You might want to read any of the papers on an aligner, such as bwa or bowtie, since they discuss the mismatch issue. There are many ways to find imperfect alignments, but searching for a seed (a contiguous subset of a sequence) and then extending that is a common method. You then look for the alignment with the highest score, which is often dependent upon: the edit-distance between the alignment and the reference, the phred quality of mismatching bases, the presence of Ns in a given region of the reference, and a set of mismatch penalties and match bonuses. Alignments are then given a MAPQ score depending on how likely they are to be correct (the supplemental material in the original MAQ paper has a really nice section on how this can be done).
Regarding knowing how closely the reference matches whatever you're looking at, which is what I would interpret "correct" to mean, there's no single answer to that. The easiest method is to just perform the alignment and see how well it works. If you get a high non-alignment rate, then maybe de novo assemble some of the unmapped reads and then blast them. That'll give you an idea if they might be describing an area where the reference sequence too poorly matches your sample (or perhaps they end up mapping to some bacteria that just happen to be contaminating your samples...).
Actually building a reference genome is a pretty involved process if the result is going to be of good quality and there are a variety of metrics performed along the way that need to be looked at in their totality to get an idea of how well things went.
modified 5.2 years ago
5.2 years ago by
Devon Ryan ♦ 91k