Question: In R, how do I apply regexes to specific parts of a string that contains a pattern?
1
gravatar for soosus
5.0 years ago by
soosus10
United States
soosus10 wrote:

I have a dataframe (**trip**) that contains a column (**SNP**). It looks like this (but longer, and it has 192 levels):

    SNP
    C[T->C]T
    C[G->C]A
    G[A->C]A
    C[T->C]C
    C[C->A]G
    T[G->A]C
    ...

I want to pattern match and replace on the following criteria:

    gsub("G->T", "C->A", trip)
    gsub("G->C", "C->G", trip)
    gsub("G->A", "C->T", trip)
    gsub("A->T", "T->A", trip)
    gsub("A->G", "T->C", trip)
    gsub("A->C", "T->G", trip)

but ALSO, if one of the patterns listed above is found, I want the string in which it's contained have additional substitutions applied. Namely:

    if ((grep(G->T|G->C|G->C|A->T|A->G|A->C), trip$SNP)==TRUE){
       substr(trip$SNP, 1,1) <- tr /ATCG/TAGC/; #incompatible perl syntax?
       substr(trip$SNP, 8,8) <- tr /ATCG/TAGC/;
       }

As in, if any of these patterns--G->T, G->C, G->C, A->T, A->G, or A->C--is found in a string in trip$SNP, replace the 1st and 8th characters in that string according to this regex: tr /ATCG/TAGC/;

Desired output, with changes in bold:

SNP
C[T->C]T
C[G->C]A
G[A->C]A
C[T->C]C
C[C->A]G
T[G->A]C

to:

SNP
C[T->C]T
**G[C->G]T
C[T->G]T**
C[T->C]C
C[C->A]G
**A[C->T]G**

Is there a more elegant way to do this?

 

regex R • 1.6k views
ADD COMMENTlink modified 5.0 years ago by Michael Dondrup46k • written 5.0 years ago by soosus10

So, you want to complement (but not reverse) the sequence in SNP if and only if the substitution is G-> something or A-> something, right? You can have a much easier solution, in single line of code then. Is the format for SNP fixed (substitution base +1 base before and after)? 

ADD REPLYlink written 5.0 years ago by Michael Dondrup46k

Yes, it's fixed. It's simply the SNP, expressed as the reference->variant, flanked by its neighbors. And yes, complement but not reverse.

ADD REPLYlink written 5.0 years ago by soosus10

Then check out code below...

ADD REPLYlink written 5.0 years ago by Michael Dondrup46k
5
gravatar for Michael Dondrup
5.0 years ago by
Bergen, Norway
Michael Dondrup46k wrote:
## assuming df is your data frame
SNP <- as.character(df$SNP)
SNP
[1] "C[T->C]T" "C[G->C]A" "G[A->C]A" "C[T->C]C" "C[C->A]G" "T[G->A]C"
i <- grep("(A|G)->", SNP)
SNP[i] <- chartr("ACGT", "TGCA", SNP[i])
SNP
[1] "C[T->C]T" "G[C->G]T" "C[T->G]T" "C[T->C]C" "C[C->A]G" "A[C->T]G"

 

 

 

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by Michael Dondrup46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 907 users visited in the last hour