Question

Extract strings based on multiple patterns in R

0

Entering edit mode

2.4 years ago

LDT ▴ 340

I have thousands of DNA sequences that look like this :).

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")

I need to extract every sequence between the CTACG and CAGTC. However, many cases in these sequences come with an error (deletion, insertion, substitution). Is there any way to account for mismatches based on Levenshtein distance?

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")

qdapRegex::ex_between(ref, "CTACG", "CAGTC")
#> [[1]]
#> [1] "GTTATGTACGATTAAAGAAGATCGT"
#> 
#> [[2]]
#> [1] "CGTTGATATTTTGCATGCTTACTCC"
#> 
#> [[3]]
#> [1] NA

reprex()
#> Error in reprex(): could not find function "reprex"

^{Created on 2021-12-18 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)}

Like this I would be able to extract the sequence also in the last case.

UPDATE: can I create a dictionary with a certain Levenshtein distance and then match it to each sequence?

DNA strings Biostrings r Levenhstein • 1.4k views

ADD COMMENT • link updated 2.4 years ago by cpad0112 21k • written 2.4 years ago by LDT ▴ 340

1

Entering edit mode

2.4 years ago

cpad0112 21k

You can try this:

> ex_between(ref, "CT[AC][GC][GC]", "CAGTC",fixed = F)
[[1]]
[1] "GTTATGTACGATTAAAGAAGATCGT"

[[2]]
[1] "CGTTGATATTTTGCATGCTTACTCC"

[[3]]
[1] "GTTGATATTTTGCATGCTTACTCC"

In OP example case, following also works:

>ex_between(ref, "CT[AC][GC]{2}", "CAGTC",fixed = F)

If you are using R just for this step, I would suggest, cutadapt, seqkit (amplicon) to use.

ADD COMMENT • link 2.4 years ago by cpad0112 21k

score 3 · Accepted Answer · 2021-12-18

Here is my suggestion, though I am not sure what to do if multiple matches of a pattern occur. Here I simple took the left/rightmost matches. Not super elegant due to the naive lapply-ing but the function vmatchPattern() which can handle a DNAStringSet (so multiple sequences at once) does not support indels yet. With n_mismatches you can define the allowed number of mismatches. This takes about 10sec for 3000 sequences of that length on my machine, so I guess that's ok.


library(Biostrings)

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", 
         "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC",
         "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA")

#/ define number of allowed mismatches
n_mismatches <- 1
lapply(1:length(ref), function(x){

  #/ allow mismatches and indels:
  dset <- DNAString(ref[x])

  #/ match 1st pattern allowing indels/mismatches
  tmp_match1 <- 
    matchPattern(pattern = "CTACG", subject = dset, 
                 max.mismatch = n_mismatches, with.indels = TRUE)

  #/ match 2st pattern allowing indels/mismatches
  tmp_match2 <- 
    matchPattern(pattern = "CAGTC", subject = dset, 
                 max.mismatch = n_mismatches, with.indels = TRUE)

  #/ leftmost coordinate is the smallest end of pattern 1 plus 1
  if(length(tmp_match1)>0){
    r1 <- min(end(tmp_match1))+1
  } else r1 <- NA

  #/ rightmost coordinate is the largest start of pattern 2 minus 1
  if(length(tmp_match2)>0){
    r2 <- max(start(tmp_match2))-1
  } else r2 <- NA

  #/ either return NA if any of the patterns went unmatched, or return the sequence in between
  if(sum(is.na(c(r1, r2))) > 0){
    return(NA)
  } else {
    return(dset[r1:r2])
  }

})

[[1]]
25-letter DNAString object
seq: GTTATGTACGATTAAAGAAGATCGT

[[2]]
25-letter DNAString object
seq: CGTTGATATTTTGCATGCTTACTCC

[[3]]
25-letter DNAString object
seq: CGTTGATATTTTGCATGCTTACTCC

[[4]]
[1] NA