Seeking tool for identifying hybridization probes in gDNA -- more like Southern blotting than PCR/microarray
3
0
Entering edit mode
9.9 years ago
cwhooker • 0

I need an app, something like a primer design app, that will scan a sequence (genome) and identify sections that are likely to work as hybridization probes in an application we are developing. We think the probes should be:

  • around 200bp long
  • not less than 55% GC
  • devoid of runs of more than 3 T's
  • and a few more similar "rules"

but wherever you see a number there, I need to be able to input that as a variable, because I need to use the software together with my physical method to refine the algorithm. That is, vary the parameters to generate multiple probes, test the probes in my application, and based on their performance go back and re-tweak the parameters; lather, rinse, repeat.

I also need the probes to have minimal repetitive/low-complexity sequence and to be, in practical terms, unique to the target organism in question, but I can deal with the last two requirements using other software, e.g. BLAST, if necessary.

So I have two specific questions:

  1. Does this thing already exist, and if so how do I get my hands on it?
  2. If it doesn't already exist, does that sound like something you would be interested in building? It would be a paying contract, so depending on answers perhaps I should post separately under "jobs".
probe-design sequence-scanning-algorithm • 3.0k views
ADD COMMENT
1
Entering edit mode
9.9 years ago
Danielk ▴ 640

Primer3 can design hyb probes.

ADD COMMENT
0
Entering edit mode

Thanks, Daniel. I looked at Primer3, but unless I'm too dumb to figure out the relevant settings (possible!) it won't do the sequence-based rules like "no runs of >3 T's" or "no more than 20 bp between adjacent G's" that I need.

ADD REPLY
0
Entering edit mode

I ended up writing a little script that applies such filters after Primer3 designed six candidate pairs over each target. Your 20bp rule will probably need manual processing no matter the tool you use. As a command line program, Primer3 is the easiest to script batch runs for.

ADD REPLY
1
Entering edit mode
9.9 years ago
Neilfws 49k

Here's something to get you (or anyone else) started. It requires BioPerl. You supply the sequence file (fasta format), minimum GC %, k-mer length and maximum consecutive "T" as command-line arguments, e.g.

perl getkmers.pl --seqfile myfile.fa --max_t 3 --min_gc 55 --length 50 > outfile

The script prints out the k-mer, GC % and length of the maximum consecutive "T". You could rewrite the code so as to use the values supplied to filter before printing the output. Or you could print everything to a file and filter later, in a spreadsheet or whatever. Of course for large files (genomes), you probably do not want to write out every k-mer.

head outfile
GTAGGGTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGT    60    3
TAGGGTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTT    58    3
AGGGTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTT    58    3
GGGTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTA    58    3
GGTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTAT    56    3
GTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTATG    56    3
TCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTATGT    54    3
CGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTATGTT    54    3
GAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTATGTTT    52    3
AGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTATGTTTT    50    4

Here's the code (rough and ready without validation, checks, tests etc.).

#!/usr/bin/perl -w

use strict;
use Getopt::Long;
use Bio::SeqIO;
use Bio::Tools::SeqStats;

GetOptions("seqfile=s" => \my $seqfile, # fasta file
           "length=i"  => \my $length,  # kmer length
           "min_gc=i"  => \my $min_gc,  # >= this GC%
           "max_t=i"   => \my $max_t);  # max consecutive T

my $seqio = Bio::SeqIO->new(-file => $seqfile, -format => "fasta");
while(my $seq = $seqio->next_seq) {
    my $seqlen = $seq->length;
    for(my $i = 1; $i <= $seqlen - ($length-1); $i++) {
        my $kmer  = $seq->subseq($i, $i+($length-1));
        my $stats = Bio::Tools::SeqStats->new(-seq => Bio::Seq->new(-seq => $kmer));
        my $mers  = $stats->count_monomers(); 
        my $gc    = 100 * ($mers->{'G'} + $mers->{'C'}) / $length;
        my $t     = 0;
        while($kmer =~ /T{$max_t,}/g) {
            $t = length($&) if length($&) > $t; # longest consecutive run of T
        }
        print $kmer, "\t", $gc, "\t", $t, "\n";
    }
}
ADD COMMENT
0
Entering edit mode

Also ... if you have a license for BLAT it is relatively straight forward to implement a BLAT check for each sequence that passes the original filtering constraints... I wouldn't do it for every probe because that would end up being very slow!

ADD REPLY
0
Entering edit mode
9.9 years ago
pld 5.1k

http://www.ncbi.nlm.nih.gov/tools/primer-blast/

You can always check the GC content after you get your primers back.

ADD COMMENT
0
Entering edit mode

Thanks, Joe. As with Primer3 alone, adding the "BLAST for unique sequences" doesn't get me the parameterizable rules. I could try to get fancy with degenerate (mixed) bases but I can't see a way to include rules like "no more than X bp between A's" by doing that.

ADD REPLY

Login before adding your answer.

Traffic: 1935 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6