Question: In R, how do I apply regexes to specific parts of a string that contains a pattern?

1

soosus •

**10**wrote:I have a dataframe (**trip**) that contains a column (**SNP**). It looks like this (but longer, and it has 192 levels):

SNP C[T->C]T C[G->C]A G[A->C]A C[T->C]C C[C->A]G T[G->A]C ...

I want to pattern match and replace on the following criteria:

gsub("G->T", "C->A", trip) gsub("G->C", "C->G", trip) gsub("G->A", "C->T", trip) gsub("A->T", "T->A", trip) gsub("A->G", "T->C", trip) gsub("A->C", "T->G", trip)

but ALSO, if one of the patterns listed above is found, I want the string in which it's contained have additional substitutions applied. Namely:

if ((grep(G->T|G->C|G->C|A->T|A->G|A->C), trip$SNP)==TRUE){ substr(trip$SNP, 1,1) <- tr /ATCG/TAGC/; #incompatible perl syntax? substr(trip$SNP, 8,8) <- tr /ATCG/TAGC/; }

As in, if any of these patterns--G->T, G->C, G->C, A->T, A->G, or A->C--is found in a string in trip$SNP, replace the 1st and 8th characters in that string according to this regex: tr /ATCG/TAGC/;

Desired output, with changes in bold:

SNP C[T->C]T C[G->C]A G[A->C]A C[T->C]C C[C->A]G T[G->A]C

to:

SNP C[T->C]T **G[C->G]T C[T->G]T** C[T->C]C C[C->A]G **A[C->T]G**

Is there a more elegant way to do this?

ADD COMMENT
• link
•
modified 5.9 years ago
by
Michael Dondrup ♦

**47k**• written 5.9 years ago by soosus •**10**
So, you want to complement (but not reverse) the sequence in SNP if and only if the substitution is G-> something or A-> something, right? You can have a much easier solution, in single line of code then. Is the format for SNP fixed (substitution base +1 base before and after)?

26k• written 5.9 years ago by Michael Dondrup ♦47kYes, it's fixed. It's simply the SNP, expressed as the reference->variant, flanked by its neighbors. And yes, complement but not reverse.

10Then check out code below...

47k