Here's a demo Python script you can modify for your use, which suggests the rough principle:
bed = """chr1\t0\t10\tABCDEFGHIJ
string_to_match = sys.argv
pattern = re.compile(string_to_match)
for line in bed.split("\n"):
(chr, start, stop, id) = line.split("\t")
for match in pattern.finditer(id):
sys.stdout.write("\t".join([chr, str(int(start) + match.start()), str(int(start) + match.end()), string_to_match]) + "\n")
Some sample runs:
$ ./test.py HIJABC
chr1 7 13 HIJABC
$ ./test.py HIJAB
chr1 7 12 HIJAB
$ ./test.py BCDEF
chr1 1 6 BCDEF
$ ./test.py ABCD
chr1 0 4 ABCD
chr1 10 14 ABCD
chr1 16 20 ABCD
One would prepare a BED file from the genome's FASTA (with spanning windows), and then modify the Python script to read that BED file on a line-by-line basis.
If a lot of these searches are done, this is also easily parallelizable, splitting work by chromosome or some other unit of work that matches the environment.
So as to avoid "double-hits", It might be worth piping the result to
uniq, or testing the length of the input
string_to_match to ensure that it is at least half as long as the interleaving spanning elements.
There are perhaps libraries that do a better job of dealing with these and similar edge cases. Still, hopefully this is a useful starting point.