Question: Reference Sequence Accuracy
gravatar for goodcow
5.2 years ago by
United States
goodcow20 wrote:

In modern sequencing, reads are aligned with a reference sequence.

What happens if a read differs from the reference genome by a SNP or sequencing error or something? How do we choose where to map it to?

More fundamentally, how can we be certain that the reference sequence is correct?

sequencing • 1.0k views
ADD COMMENTlink modified 5.2 years ago by Ashutosh Pandey11k • written 5.2 years ago by goodcow20

how would you define "correct" ? :-)

ADD REPLYlink written 5.2 years ago by Pierre Lindenbaum122k

Reflecting what the true sequence is

ADD REPLYlink written 5.2 years ago by goodcow20

That begs the question, "The true sequence of what?" For example, since there's not a single universal human genome, the reference is just an abstract approximation of something that should generally match (in fact, it's a composite of multiple individuals).

ADD REPLYlink written 5.2 years ago by Devon Ryan91k

Good point, I hadn't thought about that. I suppose that over the sequencing of multiple genomes of the same type, errors in highly conserved regions that would be the same could be phased out by consensus; similarly, areas of variation could be identified.

ADD REPLYlink written 5.2 years ago by goodcow20
gravatar for Devon Ryan
5.2 years ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

N.B., many people do de novo assembly due to having no (or a poor quality) reference sequence, so aligning to a reference isn't really a must.

You might want to read any of the papers on an aligner, such as bwa or bowtie, since they discuss the mismatch issue. There are many ways to find imperfect alignments, but searching for a seed (a contiguous subset of a sequence) and then extending that is a common method. You then look for the alignment with the highest score, which is often dependent upon: the edit-distance between the alignment and the reference, the phred quality of mismatching bases, the presence of Ns in a given region of the reference, and a set of mismatch penalties and match bonuses. Alignments are then given a MAPQ score depending on how likely they are to be correct (the supplemental material in the original MAQ paper has a really nice section on how this can be done).

Regarding knowing how closely the reference matches whatever you're looking at, which is what I would interpret "correct" to mean, there's no single answer to that. The easiest method is to just perform the alignment and see how well it works. If you get a high non-alignment rate, then maybe de novo assemble some of the unmapped reads and then blast them. That'll give you an idea if they might be describing an area where the reference sequence too poorly matches your sample (or perhaps they end up mapping to some bacteria that just happen to be contaminating your samples...).

Actually building a reference genome is a pretty involved process if the result is going to be of good quality and there are a variety of metrics performed along the way that need to be looked at in their totality to get an idea of how well things went.

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Devon Ryan91k
gravatar for Ashutosh Pandey
5.2 years ago by
Ashutosh Pandey11k wrote:

Another somewhat related question I asked a few years back:

Error In Reference Genome

ADD COMMENTlink written 5.2 years ago by Ashutosh Pandey11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1540 users visited in the last hour