Extract strings based on multiple patterns in R
2
0
Entering edit mode
2.4 years ago
LDT ▴ 340

I have thousands of DNA sequences that look like this :).

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")

I need to extract every sequence between the CTACG and CAGTC. However, many cases in these sequences come with an error (deletion, insertion, substitution). Is there any way to account for mismatches based on Levenshtein distance?

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")

qdapRegex::ex_between(ref, "CTACG", "CAGTC")
#> [[1]]
#> [1] "GTTATGTACGATTAAAGAAGATCGT"
#> 
#> [[2]]
#> [1] "CGTTGATATTTTGCATGCTTACTCC"
#> 
#> [[3]]
#> [1] NA

reprex()
#> Error in reprex(): could not find function "reprex"

Created on 2021-12-18 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)

Like this I would be able to extract the sequence also in the last case.

UPDATE: can I create a dictionary with a certain Levenshtein distance and then match it to each sequence?

DNA strings Biostrings r Levenhstein • 1.4k views
ADD COMMENT
3
Entering edit mode
2.4 years ago
ATpoint 82k

Here is my suggestion, though I am not sure what to do if multiple matches of a pattern occur. Here I simple took the left/rightmost matches. Not super elegant due to the naive lapply-ing but the function vmatchPattern() which can handle a DNAStringSet (so multiple sequences at once) does not support indels yet. With n_mismatches you can define the allowed number of mismatches. This takes about 10sec for 3000 sequences of that length on my machine, so I guess that's ok.


library(Biostrings)

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", 
         "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC",
         "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA")

#/ define number of allowed mismatches
n_mismatches <- 1
lapply(1:length(ref), function(x){

  #/ allow mismatches and indels:
  dset <- DNAString(ref[x])

  #/ match 1st pattern allowing indels/mismatches
  tmp_match1 <- 
    matchPattern(pattern = "CTACG", subject = dset, 
                 max.mismatch = n_mismatches, with.indels = TRUE)

  #/ match 2st pattern allowing indels/mismatches
  tmp_match2 <- 
    matchPattern(pattern = "CAGTC", subject = dset, 
                 max.mismatch = n_mismatches, with.indels = TRUE)

  #/ leftmost coordinate is the smallest end of pattern 1 plus 1
  if(length(tmp_match1)>0){
    r1 <- min(end(tmp_match1))+1
  } else r1 <- NA

  #/ rightmost coordinate is the largest start of pattern 2 minus 1
  if(length(tmp_match2)>0){
    r2 <- max(start(tmp_match2))-1
  } else r2 <- NA

  #/ either return NA if any of the patterns went unmatched, or return the sequence in between
  if(sum(is.na(c(r1, r2))) > 0){
    return(NA)
  } else {
    return(dset[r1:r2])
  }

})

[[1]]
25-letter DNAString object
seq: GTTATGTACGATTAAAGAAGATCGT

[[2]]
25-letter DNAString object
seq: CGTTGATATTTTGCATGCTTACTCC

[[3]]
25-letter DNAString object
seq: CGTTGATATTTTGCATGCTTACTCC

[[4]]
[1] NA
ADD COMMENT
1
Entering edit mode
2.4 years ago

You can try this:

> ex_between(ref, "CT[AC][GC][GC]", "CAGTC",fixed = F)
[[1]]
[1] "GTTATGTACGATTAAAGAAGATCGT"

[[2]]
[1] "CGTTGATATTTTGCATGCTTACTCC"

[[3]]
[1] "GTTGATATTTTGCATGCTTACTCC"

In OP example case, following also works:

>ex_between(ref, "CT[AC][GC]{2}", "CAGTC",fixed = F)

If you are using R just for this step, I would suggest, cutadapt, seqkit (amplicon) to use.

ADD COMMENT

Login before adding your answer.

Traffic: 2069 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6