A regex to convert operon names to genes?
1
0
Entering edit mode
8 weeks ago
rioualen ▴ 580

Hi,

I would like to convert operon names to gene names (and the reverse). I think this should be possible with a regex, but I'm not fluent enough with regexes to crack it up.

Conventionally, operons are named like this:

genes           operon_name  strand
oneA,oneB,oneC  oneABC       +
oneA,oneB,oneC  oneCBA       -
oneA,oneB,twoD  oneAB-twoD   +

Occasionally operons can also come out as "oneA-oneB-oneC" or "someID-someotherID".

Any tip on how to get this to work, preferably in R? It doesn't have to work in all cases, but it'd help a lot if it allowed me to reduce the amount of manual intervention.

Thanks a lot.

regex r gene operon • 169 views
ADD COMMENT
1
Entering edit mode
8 weeks ago
Dunois ★ 1.5k

This isn't a regex problem at all, but here's a solution in R.

library(stringr)
library(magrittr)
library(tidyr)
library(dplyr)


#Toy data.
df <- data.frame(genes = c("oneA,oneB,oneC", "oneA-oneB-oneC", "oneA,oneB,twoD", "someID1-someID2-someotherID", 
                           "someID-someotherID"), 
                 strand = c("+", "-", "+", "-", "+"))



#Assigning a grouping identifier to each set of genes that constitute an operon.
#Also separating the genes into their own respective rows.
df %<>%
  mutate(grp = row_number()) %>%
  separate_rows(genes, sep = "[,\\-]")

#Extracting the operon component (e.g., "one") and gene component (e.g., "A") 
#identifiers into separate columns.
df %<>%
  mutate(op1 = str_extract(genes, "^[a-z]+"),
         op2 = str_extract(genes, "[A-Z0-9]+$"))

#Grouping by grp and ollapsing the genes together for later use.
df %<>%
  group_by(grp) %>%
  mutate(genes = paste0(genes, collapse = ",")) %>%
  ungroup()

#Grouping by the operon grouping and operon component to collapse the gene components
#into a single row each.
#Prior to collapsing, orienting the gene components correctly based on strand
#orientation.
#Then retaining only unique operon components (since the gene components are
#now duplicated across rows.)
df %<>%
  group_by(grp, op1) %>%
  mutate(op2 = ifelse(strand == "+", op2, sort(op2, decreasing = TRUE))) %>%
  mutate(op2 = paste0(op2, collapse = "")) %>%
  distinct(op1, .keep_all = TRUE) %>%
  ungroup()

#Putting the operon and gene components back together and collapsing
#the operon components by grp, and removing duplicates + columns.
df %<>%
  mutate(op = paste0(op1, op2)) %>%
  group_by(grp) %>%
  mutate(op = paste0(op, collapse = "-")) %>%
  distinct(op, .keep_all = TRUE) %>%
  ungroup() %>%
  select(-c(grp, op1, op2))

#Final result.
df

# # A tibble: 5 × 3
#   genes                       strand op                    
#   <chr>                       <chr>  <chr>                 
# 1 oneA,oneB,oneC              +      oneABC                
# 2 oneA,oneB,oneC              -      oneCBA                
# 3 oneA,oneB,twoD              +      oneAB-twoD            
# 4 someID1,someID2,someotherID -      someID2ID1-someotherID
# 5 someID,someotherID          +      someID-someotherID

Look at the comments in the code for explanations. I consider the solution incomplete.

Occasionally operons can also come out as "oneA-oneB-oneC" or "someID-someotherID".

Doesn't help, because you haven't given us any idea of how those are supposed to be treated.

ADD COMMENT

Login before adding your answer.

Traffic: 2771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6