Question: Can Variants Be Reliably Detected In Duplicated Gene/Region Sequence?
gravatar for Ian
7.7 years ago by
University of Manchester, UK
Ian5.7k wrote:

I am currently working on a project to detect variants in related yeast strains, which is simple enough, at least for variant calling :).

However, the PI is interested in genes that have been duplicated, e.g. ribosome genes. This means that uniquely mapping reads to the genome results in zero coverage over the genes/regions of interest.

Has any one done something similar to this?

What would the best mapping strategy be to include duplicated sequence, but also be suitable for variant detection?

Part of me wonders whether this is even possible. But the the worst can think can happen is that reads will be split between duplicates, but a mismatch could lead to a misplaced read...

I am currently using bowtie to obtain --best -k1 reads with other default settings, leading to samtools based variant detection.

BTW reads a colour-space from a SOLiD4.


variant-calling • 2.6k views
ADD COMMENTlink modified 7.7 years ago by Giovanni M Dall'Olio27k • written 7.7 years ago by Ian5.7k
gravatar for Giovanni M Dall'Olio
7.7 years ago by
London, UK
Giovanni M Dall'Olio27k wrote:

I think that this is very difficult. Most of the methods to detect SNPs do not work correctly with duplicated regions, and in fact, a best practice is to remove all the reads that map to multiple regions of the genome before doing SNP calling. These reads are likely to be copy number variations, so it's better to remove them, as the SNP found will be likely to be a false positive.

Quoted from the 1000 Genomes paper:

"We restricted most variant calling to the ‘accessible genome’, defined as that portion of the reference sequence that remains after excluding regions with many ambiguously placed reads or unexpectedly high or low numbers of aligned reads".

The regions that have an "unexpectedly high or low numbers of reads" are likely to be duplications and deletions. So, in 1000 Genomes, they remove all the ambiguous reads before doing the calling. I am sure that if you take any other article presenting novel SNP data (Hapmap, etc..), you will be able to find a similar sentence in the Methods. Hopefully that will be enough to convince your PI :-)

For example, let's imagine that the following sequence got duplicated in the genome:


The resulting genome will look like:


After a number of generations, the nucleotide in the 5th position of the second duplicated region get mutated:


If you do a SNP calling without taking into account that this sequence can be duplicated, you will believe that there is a single copy of this duplication, containing a SNP in the 5th position, and with a frequency of 50%. But this would be a false positive, as there is no SNP, but a duplication.

ADD COMMENTlink modified 14 months ago by _r_am31k • written 7.7 years ago by Giovanni M Dall'Olio27k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2103 users visited in the last hour