Question

In R, how do I apply regexes to specific parts of a string that contains a pattern?

1

Entering edit mode

10.0 years ago

soosus ▴ 10

I have a dataframe (trip) that contains a column (SNP). It looks like this (but longer, and it has 192 levels):

SNP
C[T->C]T
C[G->C]A
G[A->C]A
C[T->C]C
C[C->A]G
T[G->A]C
...

I want to pattern match and replace on the following criteria:

gsub("G->T", "C->A", trip)
gsub("G->C", "C->G", trip)
gsub("G->A", "C->T", trip)
gsub("A->T", "T->A", trip)
gsub("A->G", "T->C", trip)
gsub("A->C", "T->G", trip)

but ALSO, if one of the patterns listed above is found, I want the string in which it's contained have additional substitutions applied. Namely:

if ((grep(G->T|G->C|G->C|A->T|A->G|A->C), trip$SNP)==TRUE){
   substr(trip$SNP, 1,1) <- tr /ATCG/TAGC/; #incompatible perl syntax?
   substr(trip$SNP, 8,8) <- tr /ATCG/TAGC/;
   }

As in, if any of these patterns--G->T, G->C, G->C, A->T, A->G, or A->C--is found in a string in trip$SNP, replace the 1st and 8th characters in that string according to this regex: tr /ATCG/TAGC/;

Desired output, with changes highlighted:

SNP
C[T->C]T
C[G->C]A
G[A->C]A
C[T->C]C
C[C->A]G
T[G->A]C

to:

SNP
C[T->C]T
G[C->G]T #<-- changed
C[T->G]T #<-- changed
C[T->C]C
C[C->A]G
A[C->T]G #<-- changed

Is there a more elegant way to do this?

regex R • 2.6k views

ADD COMMENT • link updated 2.7 years ago by Ram 43k • written 10.0 years ago by soosus ▴ 10

0

Entering edit mode

So, you want to complement (but not reverse) the sequence in SNP if and only if the substitution is G-> something or A-> something, right? You can have a much easier solution, in single line of code then. Is the format for SNP fixed (substitution base +1 base before and after)?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 10.0 years ago by Michael 54k

0

Entering edit mode

Yes, it's fixed. It's simply the SNP, expressed as the reference->variant, flanked by its neighbors. And yes, complement but not reverse.

ADD REPLY • link 10.0 years ago by soosus ▴ 10

0

Entering edit mode

Then check out code below...

ADD REPLY • link 10.0 years ago by Michael 54k

Ram · Accepted Answer · 2014-04-28

5

Entering edit mode

10.0 years ago

Michael 54k

## assuming df is your data frame
SNP <- as.character(df$SNP)

SNP
[1] "C[T->C]T" "C[G->C]A" "G[A->C]A" "C[T->C]C" "C[C->A]G" "T[G->A]C"

i <- grep("(A|G)->", SNP)
SNP[i] <- chartr("ACGT", "TGCA", SNP[i])

SNP
[1] "C[T->C]T" "G[C->G]T" "C[T->G]T" "C[T->C]C" "C[C->A]G" "A[C->T]G"

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 10.0 years ago by Michael 54k