Hi,
I am investigating a specific type of recombination that involves microhomology (2-20 bp) between short direct repeats. My current method involves identifying all putative sites of recombination and using bl2seq to compare the short repeats. This is working fine, but BLAST has a word size limit of 4 for DNA comparisons, so I'm only able to identify putative events involving sequences >4 bp. Are there any methods for doing string comparison to identify this type of microhomology (2 or 3 bp of consecutive matching bases on 20 bp sequences). To simplify things, I will only be interested in the ends of the sequences, and I know which ends should match.
Here is an example:
ATCTAGTACGGATCGTACGTT
GTTATCTGAGCGAAAGCTAA
This is a comparison I'll be doing thousands of times, so I'd rather not be constructing an index or database for each comparison. I have coded this in Perl, so if it makes more sense to come up with a pure-Perl solution, I'm open to that.
Thanks.
Just because I'm curious... If it is not "top secret", why do you want to do that? (Can you really infer homology with a so short alignment region?)
No secret. It is actually a well-known mechanism I am studying called illegitimate recombination that has been detailed in many species including yeast and plants. The very name of the mechanism describes the fact that it involves very short regions of homology and operates outside of the normal recombinational machinery (i.e., those involving RecA). I'm just trying to better understand the process and not introduce any bias into the analyses.
note that the probability of two sequences having three nucleotides in common at their extremities is quite high. If you sum sequencing errors, the matches you are getting may just be random events.
The probability is not random if you are analyzing nonrandom sites in the genome, and as I said, I'm studying recombination events. If you are analyzing deletion events that are shared by multiple copies and know the evolutionary history, or origin, of certain events, it is not random at all.