Question

A regex to convert operon names to genes?

0

Entering edit mode

3.9 years ago

rioualen ▴ 750

Hi,

I would like to convert operon names to gene names (and the reverse). I think this should be possible with a regex, but I'm not fluent enough with regexes to crack it up.

Conventionally, operons are named like this:

genes           operon_name  strand
oneA,oneB,oneC  oneABC       +
oneA,oneB,oneC  oneCBA       -
oneA,oneB,twoD  oneAB-twoD   +

Occasionally operons can also come out as "oneA-oneB-oneC" or "someID-someotherID".

Any tip on how to get this to work, preferably in R? It doesn't have to work in all cases, but it'd help a lot if it allowed me to reduce the amount of manual intervention.

Thanks a lot.

regex r gene operon • 1.0k views

ADD COMMENT • link updated 3.9 years ago by Dunois ★ 2.9k • written 3.9 years ago by rioualen ▴ 750

score 1 · Accepted Answer · 2021-08-26

This isn't a regex problem at all, but here's a solution in R.

library(stringr)
library(magrittr)
library(tidyr)
library(dplyr)


#Toy data.
df <- data.frame(genes = c("oneA,oneB,oneC", "oneA-oneB-oneC", "oneA,oneB,twoD", "someID1-someID2-someotherID", 
                           "someID-someotherID"), 
                 strand = c("+", "-", "+", "-", "+"))



#Assigning a grouping identifier to each set of genes that constitute an operon.
#Also separating the genes into their own respective rows.
df %<>%
  mutate(grp = row_number()) %>%
  separate_rows(genes, sep = "[,\\-]")

#Extracting the operon component (e.g., "one") and gene component (e.g., "A") 
#identifiers into separate columns.
df %<>%
  mutate(op1 = str_extract(genes, "^[a-z]+"),
         op2 = str_extract(genes, "[A-Z0-9]+$"))

#Grouping by grp and ollapsing the genes together for later use.
df %<>%
  group_by(grp) %>%
  mutate(genes = paste0(genes, collapse = ",")) %>%
  ungroup()

#Grouping by the operon grouping and operon component to collapse the gene components
#into a single row each.
#Prior to collapsing, orienting the gene components correctly based on strand
#orientation.
#Then retaining only unique operon components (since the gene components are
#now duplicated across rows.)
df %<>%
  group_by(grp, op1) %>%
  mutate(op2 = ifelse(strand == "+", op2, sort(op2, decreasing = TRUE))) %>%
  mutate(op2 = paste0(op2, collapse = "")) %>%
  distinct(op1, .keep_all = TRUE) %>%
  ungroup()

#Putting the operon and gene components back together and collapsing
#the operon components by grp, and removing duplicates + columns.
df %<>%
  mutate(op = paste0(op1, op2)) %>%
  group_by(grp) %>%
  mutate(op = paste0(op, collapse = "-")) %>%
  distinct(op, .keep_all = TRUE) %>%
  ungroup() %>%
  select(-c(grp, op1, op2))

#Final result.
df

# # A tibble: 5 × 3
#   genes                       strand op                    
#   <chr>                       <chr>  <chr>                 
# 1 oneA,oneB,oneC              +      oneABC                
# 2 oneA,oneB,oneC              -      oneCBA                
# 3 oneA,oneB,twoD              +      oneAB-twoD            
# 4 someID1,someID2,someotherID -      someID2ID1-someotherID
# 5 someID,someotherID          +      someID-someotherID

Look at the comments in the code for explanations. I consider the solution incomplete.

Occasionally operons can also come out as "oneA-oneB-oneC" or "someID-someotherID".

Doesn't help, because you haven't given us any idea of how those are supposed to be treated.