Question: In R, how do I apply regexes to specific parts of a string that contains a pattern?

soosus

I have a dataframe (**trip**) that contains a column (**SNP**). It looks like this (but longer, and it has 192 levels):

SNP C[T->C]T C[G->C]A G[A->C]A C[T->C]C C[C->A]G T[G->A]C ...

I want to pattern match and replace on the following criteria:

gsub("G->T", "C->A", trip) gsub("G->C", "C->G", trip) gsub("G->A", "C->T", trip) gsub("A->T", "T->A", trip) gsub("A->G", "T->C", trip) gsub("A->C", "T->G", trip)

but ALSO, if one of the patterns listed above is found, I want the string in which it's contained have additional substitutions applied. Namely:

if ((grep(G->T|G->C|G->C|A->T|A->G|A->C), trip$SNP)==TRUE){ substr(trip$SNP, 1,1) <- tr /ATCG/TAGC/; #incompatible perl syntax? substr(trip$SNP, 8,8) <- tr /ATCG/TAGC/; }

As in, if any of these patterns--G->T, G->C, G->C, A->T, A->G, or A->C--is found in a string in trip$SNP, replace the 1st and 8th characters in that string according to this regex: tr /ATCG/TAGC/;

Desired output, with changes in bold:

SNP C[T->C]T C[G->C]A G[A->C]A C[T->C]C C[C->A]G T[G->A]C

to:

SNP C[T->C]T **G[C->G]T C[T->G]T** C[T->C]C C[C->A]G **A[C->T]G**

Is there a more elegant way to do this?

5.0 years ago
by
Michael Dondrup

written 5.0 years ago by soosus
So, you want to complement (but not reverse) the sequence in SNP if and only if the substitution is G-> something or A-> something, right? You can have a much easier solution, in single line of code then. Is the format for SNP fixed (substitution base +1 base before and after)?

Yes, it's fixed. It's simply the SNP, expressed as the reference->variant, flanked by its neighbors. And yes, complement but not reverse.

Then check out code below...

