Reference Bias In Alignments
4
3
Entering edit mode
10.8 years ago

I have read several articles referring to "reference bias in mapping reads to a reference". How is "reference bias" defined? What is it conceptually? Could someone share an example with potential consequences in variant calling/RNA-seq analyses?

thanks!

mapping • 5.8k views
ADD COMMENT
4
Entering edit mode
10.8 years ago
Fwip ▴ 490

If you're mapping reads to a reference, the result is going to resemble the reference fairly closely, or the mapper wouldn't be able to do its job. If you used a different reference, the output would resemble that one instead. Conceptually, that's where the "bias" comes in.

ADD COMMENT
2
Entering edit mode
10.8 years ago
lh3 33k

I do not think reference bias has a clear definition. In my definition, it denotes the effect that reads possessing the reference allele are mapped better. This will cause many artifacts. For example, at heterozygotes, the reference allele gets more support. It has a major impact to allele specific expression with RNA-seq.

ADD COMMENT
0
Entering edit mode

I've observed the same effect in a deep-coverage pooled sequencing experiment of mitochondrial DNA. When I attempt to estimate the MAF from the pooled data, my estimate is systematically lower than expected because reads that contain the alternate allele have a lower probability of aligning.

ADD REPLY
1
Entering edit mode
10.8 years ago

A specific example: you have measured RNA abundance with RNA-seq in two strains of mice. The mouse reference is the C57BL/6 strain. The first strain in your experiment is a close relative of C57BL/6, while the second strain is a wild-derived strain that is evolutionarily distant from C57BL/6. The close-relative genome is very similar to C56BL/6, while the wild-derived strain is much more divergent. When you align RNA from these strains against the reference, there will be more polymorphisms distinguishing wild vs. reference when compared to close-relative vs. reference. These polymorphisms may affect your alignment success rates in a way that biases your read counts against the more distantly related strain.

ADD COMMENT
1
Entering edit mode
10.8 years ago

This table http://www.nature.com/nrg/journal/v10/n4/box/nrg2554_BX1.html. and its caption might shed some light:

There are fewer novel single nucleotide variants in J. Craig Venter's genome owing to that fact that his genome was partially represented in the Celera human genome assembly2 and variants in that assembly were subsequently mined and deposited into dbSNP117.

You can also imagine that in a few cases, probably very few, short reads from some of Craig Venter's loci would align to the reference whereas reads from another individual might not align at all (too many mismatches).

I feel the need to take a shower after writing this.

ADD COMMENT

Login before adding your answer.

Traffic: 2594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6