I have thousands of DNA sequences that look like this :).
ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC",
"CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")
I need to extract every sequence between the CTACG and CAGTC. However, many cases in these sequences come with an error (deletion, insertion, substitution). Is there any way to account for mismatches based on Levenshtein distance?
ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC",
"CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")
qdapRegex::ex_between(ref, "CTACG", "CAGTC")
#> [[1]]
#> [1] "GTTATGTACGATTAAAGAAGATCGT"
#>
#> [[2]]
#> [1] "CGTTGATATTTTGCATGCTTACTCC"
#>
#> [[3]]
#> [1] NA
reprex()
#> Error in reprex(): could not find function "reprex"
Created on 2021-12-18 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)
Like this I would be able to extract the sequence also in the last case.
UPDATE: can I create a dictionary with a certain Levenshtein distance and then match it to each sequence?