Given a sequence
my_seq = "AGCTTCATGAAGCTGAGTNGGACGCGATGATGCG"
We can make a collection of 10-bp reads
my_reads = substring(my_seq,1:(nchar(my_seq)-9),10:nchar(my_seq))
which I want to map to a
target sequence targetseq:
targetseq = "AGCTTCATGAAGCTGAGTGGGACGCGATGATGCGACTAGGGACCTTAGCAGC"
treating "N" in
my_reads as a wildcard, so all
reads map to
I want to do this kind of string matching in a large database of strings, and I found the
Biostrings be very efficient to do that. But, in order to be efficient, Biostrings requires you to convert your collection of substrings (here
my_reads) into a dictionary of class
"pdict" using the function
PDidc(). You can use this 'dictionary' later in functions like
countPDict() to count matches against a target sequence.
In order to use wildcards, you have to divide your dictionary in 3 parts: a head (left), a trusted band (middle) and a tail (right). You can only have wildcards, like "N", in the head or tail, but not in the trusted band, and you cannot have a trusted band of width = 0. So, for example, PDict() will complain about the read
15 if you use a trusted band of minimum width = 1 like:
> PDict(my_reads[1:15],tb.start=5,tb.end=5) Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr, : non base DNA letter found in Trusted Band for pattern 15
because the "N" is right in the trusted band.
> PDict(my_reads[1:14],tb.start=5,tb.end=5) #is OK TB_PDict object of length 14 and width 10 (preprocessing algo="ACtree2"): - with a head of width 4 - with a Trusted Band of width 1 - with a tail of width 5
Is there any way to circumvent this limitation imposed by
Biostrings? Or does anyone have a different idea? I'm almost giving up and using a following approach: generate reads and an index with the target sequence (saving both as fasta), use a mapper like GEM to map the reads to the index, then parse the output file to count the matches!
P.S.: I also have "N" in the target sequence sometimes, but I guess I can address that by setting the argument 'fixed' in countPDict and related functions to 'FALSE'.