Question: Primer design for variants in duplicated genes?
gravatar for emyli
2.9 years ago by
emyli10 wrote:

Hi there,

I have whole genome sequencing data I am using to look for novel variants. I have filtered the data and have a short list of potential novel variants - so now I want to validate if they are truly present in my DNA sample or some sort of sequencing artefact, by PCR amplification. However, while trying to design gene specific primers for a number of these variants, I am finding that the primer pairs are amplifying multiple genomic regions of identical size, and upon alignment of these regions and the sorround DNA sequence, they are almost identical, ie. it seems the variants are located in duplicated genes. This of course could be why these variants are coming up in my WGS analysis in the first place, the reads may not have aligned properly. Has anyone encountered a similar issue? Is there a way to validate these variants? Any advice would be much appreciated!

ADD COMMENTlink modified 2.3 years ago by Biostar ♦♦ 20 • written 2.9 years ago by emyli10

In order to help you, please provide some details: how did you align the data, what organism, which variant caller, did you apply a MAPQ threshold, give an example of a region that you cannot get clear bands from (coordinates), how did you make the primers (did you BLAST them)?

ADD REPLYlink written 2.9 years ago by ATpoint38k

Try using primer-blast to generate primers if you have not already.

How large are the repetitive regions and what does the distribution of variants look like in the region? You might be able to Sanger if you can find a unique flanking sequence.

ADD REPLYlink written 2.9 years ago by Daniel E Cook240
gravatar for Kevin Blighe
2.9 years ago by
Kevin Blighe65k
Kevin Blighe65k wrote:

The issues you face relate to the fact that the majority of the genome exhibits sequence similarity, i.e., similarity with other regions in the genome. Much of this is indeed related to gene duplication events, with the duplicated genes acquiring new functionality over time due to mutations. As a rough idea, there are up to 50,000 identified pseudogenes (who knows, exactly), which can be divided into:

  • processed pseudogenes: the pseudogene consists of the transcribed mRNA of the original gene
  • unprocessed pseudogenes: the pseudogene consists of the genomic sequence of the original gene

This translates into issues with Exome-seq because the primers used for sequence pull-down in exome-seq are not designed with these issues of sequence similarity in mind. Thus, when you align the data, you can set things like read length and mapping quality (MAPQ) to be high but then you'll see very low coverage over regions of high sequence similarity. On the other hand, if you relax the thresholds, you run the risk of misalignment and making false-positive or -negative variant calls.

What to do? To validate findings, you need to ensure that you design primers that uniquely target the region surrounding the variant being studied. If you cannot find a unique region in close proximity, you'll have to think about doing:

  • long-range PCR
  • Sanger or Roche 454 sequencing (long reads...)
  • MLPA

If you want assistance in designing the best possible primers, then please follow my standard operating procedure that I wrote back in 2012, with which I and colleagues have had success: Designing a single set of primers and probe for a genomic region of interest. In it, you will have to skip step 5.6 as it requires the use of Primer Express, but this is only needed in order to develop the probe that's used in addition to a primer pair in real-time PCR.


ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Kevin Blighe65k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1782 users visited in the last hour