Checking primers against many target genes, with mismatch tolerance
3
1
Entering edit mode
4.6 years ago
rotem ▴ 10

I have a primer pair and I want to test it against a given set of many target genes. How can I do that while allowing some mismatches?

I tried using primer3 (and primer-blast), but it only found matches where the exact primer sequence is contained in the target sequence. Even if the target sequence contained an 'N', the primer sequence had to also contain an 'N' at the same location to be found by primer3.

To be clear, I don't want to test the primers against some online data base, but against my own local database.

primers mismatch tolerance • 3.3k views
1
Entering edit mode

Some mismatches means you allow 1 or 2 mismatches, or does it also matter where in the primer those mismatches are? If I'm not terribly mistaken mismatches at the 5' are better tolerated than 3' mismatches. Do you have some programming experience?

0
Entering edit mode

You can try to write a script which first will generate all possible primer sequences with one mismatch, then with two mismatch etc. And then try to align with reference. To do that you can use PrimerMap with a list of all generated primer sequences.

http://www.bioinformatics.org/sms2/primer_map.html

3
Entering edit mode
4.6 years ago
Erik Wright ▴ 390

You can confirm specificity and sensitivity of primers using the R package DECIPHER. It uses a model of hybridization and elongation efficiency based on the location and type of mismatches. For example:

library(DECIPHER)
?AmplifyDNA # displays helpful information
dna <- readDNAStringSet("<<path to FASTA file>>")
primers <- c("GGCTGTTGTTGGTGTT", "TGTCATCAGAACACCAA") # forward and reverse
amplicons <- AmplifyDNA(primers, dna, annealingTemp=55, P=4e-7, maxProductSize=500, includePrimers=FALSE)
lapply(amplicons, vcountPattern, dna)


I hope that helps!

0
Entering edit mode

Thanks Erik! This looks promising. After I installed DECIPHER and prepared all the sequences, AmplifyDNA crashed because the command hybrid-min (probably the core of the annealing algorithm) was not identified. I read that hybrid-min is a part of OligoArrayAux, and it might be required for AmplifyDNA to work. However, it's a C package, and I have a Mac, so I might not succeed installing it without also installing Xcode developer tools, which I don't want to do.

Any advice on how to use DECIPHER successfully? Many thanks! Rotem

0
Entering edit mode

Hi Rotem, You will need to install OligoArrayAux as the documentation says. This only requires a C compiler such as gcc. Xcode is one option, and you can always remove the Xcode application after it installs the necessary tools (simply delete it).

0
Entering edit mode

Yes! It's working. I just noticed you're the author -- thanks so much for this contribution!

Can you help me understand how I can map the products back to the templates? Meaning, if I have 10 templates and I get 2 products, how do I know which templates got amplified? According to the documentation I should see that in "names", but I always see there (1 x 2).

0
Entering edit mode

Hi Rotem,

I am glad it is working for you. The current output does not include which template amplified in the DNAStringSet. It simply gives the predicted amplification efficiency followed by the primer set (1 x 2 = 1st and 2nd) for each amplicon.

I might add your requested feature in a future version of DECIPHER. For now, you can simply map your amplicons to the templates if the argument includePrimers = FALSE. I have edited the answer to reflect this.

Erik

0
Entering edit mode

This makes sense. Thanks again!

1
Entering edit mode

"Overall amplification efficiency of the PCR product is then calculated as the geometric mean of the two (i.e., forward and reverse) primers' efficiencies."

Oh dear :/

0
Entering edit mode

The amplification efficiency model is published in:

ES Wright et al. (2013) "Exploiting Extension Bias in PCR to Improve Primer Specificity in Ensembles of Nearly Identical DNA Templates." Environmental Microbiology, doi:10.1111/1462-2920.12259.

Indeed, the efficiency of a primer set (forward and reverse primers) is the geometric mean of its constituent primers. You can run a simulation to verify this.

2
Entering edit mode

I don't need to run a simulation - i've been doing PCR for over a decade, having designed over 13 million primer pairs for commercial use. I've been using UNAfold for designing primers for over 6 years. I've been telling anybody who wanted to listen for the last 5 years that primer designing is not a mapping problem, but people always do exactly like you just did - use long works like "geometric mean of it's constituent primers" and then defer to a publication. I'm not trying to pick on you, but anyone who has actually done PCR knows that your reaction is only as specific as your worst primer, and stating otherwise because "simulation" makes all bioinformaticians look bad in the eyes of biologists, which is absolutely not what we want.

OK that's a pretty hefty claim. I'll have to find some data.

Can you explain why aaaaaaaaaaaaaaaaaaaaaaaa has a "PCR Efficiency" of 100%, while aaaaaaaaaaaaaaaaaaaaaaaaa (1 more a) has an efficiency of 0%?

I know, i know, mismatch cutoffs. But can you explain why ANYTHING has a binding efficiency of 100% here? I'll tell you why - it's because your program solves the wrong problem. Doesn't matter of a sequence appears 1 time or a million times, it counts for 1.

1
Entering edit mode

Hi John,

As you pointed out, there is a mismatch cutoff built into the program. The user can specify a different cutoff, but I gather that is not the point you are trying to make.

Have you tried your example primer with your example template in a PCR reaction at the conditions you specified? And, if so, what was its amplification efficiency?

In my experience the program does a decent job of predicting experimental results for real primer sets on real templates. If you find cases where it fails then I would be happy to take a look.

Granted, it is not perfect. If you have a better solution available then please post it.

Erik

1
Entering edit mode

I believe one of the points that John is trying to make (and I'm sure he'll correct me if I'm wrong) is that your model for amplification efficiency does not account for some factors (such as template complexity, primer complexity, insert size) that are known to impact PCR amplification in real-world samples.

In his example, I can virtually guarantee that using his first primer plus ANY second primer will generate an abundant 48bp product of 'aaaaaaaaaaaaaaaaaaaatttttttttttttttttttttttt', lesser amounts of out-of-phase (e.g., 25a24t) products, possibly some 96bp (and higher order) repeats, and none of the desired first primer plus second primer product. Similarly, his second primer would generate 25a25t (and related) products with an efficiency virtually identical to the first.

His statement regarding biologists' low opinions of computational simulations, while harsh, has some validity - if the simulation does not sufficiently model real-world conditions, then it does reflect poorly on the discipline and its practitioners.

1
Entering edit mode

I'll admit I was unnecessarily harsh - although it's no excuse, i've had the worst headache all day, coughing and coughing and coughing, so i'm in a poor mood. Erik's point about "why don't you make a better one?" is totally valid, and it's actually the reason why this topic makes me so upset - because I can't.

But for-profit companies do have better in-house software -- it's how all the capture-sequencing works. They didn't manually test all of the millions of primers needed to make that work. The algorithm used to design the primers just produces primers that work every time. And then I think about all the man-hours wasted trying different salt concentrations and thermocycler voodoo. It frustrates the hell out of me -- but it's not you or your software's fault at all. You're trying to do the right thing, and i'm just being unhelpful.

2
Entering edit mode
4.6 years ago
John 13k

You are trying to apply the solution for the read-mapping problem to the primer problem, and primers don't work like that for two reasons:

The first is as WdC says, primers tolerate mismatches at their 5' end much better than 3'. Actually 5' mismatches are close to irrelevant. Only mismatches in the last 3bp or so really matter, enough to disturb polymerase.

The second is that the primer problem is one of specificity. A sequence like "AAAAAAAAAAAAAATAACAAA" might exist in the genome only once, and therefore a read containing it can be mapped and "solved" for the read mapping problem. However, for a primer, this will not work at all, because there will be hundreds if not thousands of sites very very similar to that sequence. It's complexity is low, and as a result it has to "compete" with all those similar sequences, even if it's unique in the genome. Mappers don't even know how many times a sequence in it's index appears in the genome.

0
Entering edit mode
4.6 years ago
rawi • 0

I'm not a pro, but I stumbled once over some questions concerning in silico PCR. Would "ipcress" be of any help? It has an option to allow mismatches in primers. apt install exonerate