In R, how do I apply regexes to specific parts of a string that contains a pattern?
1
1
Entering edit mode
7.0 years ago
soosus ▴ 10

I have a dataframe (**trip**) that contains a column (**SNP**). It looks like this (but longer, and it has 192 levels):

    SNP
    C[T->C]T
    C[G->C]A
    G[A->C]A
    C[T->C]C
    C[C->A]G
    T[G->A]C
    ...

I want to pattern match and replace on the following criteria:

    gsub("G->T", "C->A", trip)
    gsub("G->C", "C->G", trip)
    gsub("G->A", "C->T", trip)
    gsub("A->T", "T->A", trip)
    gsub("A->G", "T->C", trip)
    gsub("A->C", "T->G", trip)

but ALSO, if one of the patterns listed above is found, I want the string in which it's contained have additional substitutions applied. Namely:

    if ((grep(G->T|G->C|G->C|A->T|A->G|A->C), trip$SNP)==TRUE){
       substr(trip$SNP, 1,1) <- tr /ATCG/TAGC/; #incompatible perl syntax?
       substr(trip$SNP, 8,8) <- tr /ATCG/TAGC/;
       }

As in, if any of these patterns--G->T, G->C, G->C, A->T, A->G, or A->C--is found in a string in trip$SNP, replace the 1st and 8th characters in that string according to this regex: tr /ATCG/TAGC/;

Desired output, with changes in bold:

SNP
C[T->C]T
C[G->C]A
G[A->C]A
C[T->C]C
C[C->A]G
T[G->A]C

to:

SNP
C[T->C]T
**G[C->G]T
C[T->G]T**
C[T->C]C
C[C->A]G
**A[C->T]G**

Is there a more elegant way to do this?

 

R regex • 2.0k views
ADD COMMENT
0
Entering edit mode

So, you want to complement (but not reverse) the sequence in SNP if and only if the substitution is G-> something or A-> something, right? You can have a much easier solution, in single line of code then. Is the format for SNP fixed (substitution base +1 base before and after)?

ADD REPLY
0
Entering edit mode

Yes, it's fixed. It's simply the SNP, expressed as the reference->variant, flanked by its neighbors. And yes, complement but not reverse.

ADD REPLY
0
Entering edit mode

Then check out code below...

ADD REPLY
5
Entering edit mode
7.0 years ago
## assuming df is your data frame
SNP <- as.character(df$SNP)

SNP
[1] "C[T->C]T" "C[G->C]A" "G[A->C]A" "C[T->C]C" "C[C->A]G" "T[G->A]C"

i <- grep("(A|G)->", SNP)
SNP[i] <- chartr("ACGT", "TGCA", SNP[i])

SNP
[1] "C[T->C]T" "G[C->G]T" "C[T->G]T" "C[T->C]C" "C[C->A]G" "A[C->T]G"
ADD COMMENT

Login before adding your answer.

Traffic: 1371 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6