Populate data frame column X with substring from column Y using R
3
0
Entering edit mode
6.7 years ago
ddowlin ▴ 70

Hi all,

I have a data frame in R with a column of gene identifiers (taken from fasta headers). These vary and include both Ensembl style (e.g. ENSP000000001) and NCBI style (e.g gi|123|ref|XP_000001.1|) as well as others.

I want to extract the accession and version numbers from the NCBI identifiers and create a new column as part of my data frame. Non-NCBI identifiers would have an NA in this column.

For example, I would like to change the following data frame:

df1 <- data.frame(gene = c('ENSP00000123', 'gi|1234567|ref|XP_001234.1|',
                           'gi|1234567|ref|XP_001267.1|', 'ENSP00000124')
                  )

To this:

                             gene   accession
1                ENSP00000123        <NA>
2 gi|1234567|ref|XP_001234.1| XP_001234.1
3 gi|1234567|ref|XP_001267.1| XP_001267.1
4                ENSP00000124        <NA>

I have tried using regmatches but this is not working the way I want it to.

df1$accession <- regmatches(df1$gene, regexpr("XP_[0-9]+\\.*[0-9]*", df1$gene))

# results in:

                         gene   accession
1                ENSP00000123 XP_001234.1
2 gi|1234567|ref|XP_001234.1| XP_001267.1
3 gi|1234567|ref|XP_001267.1| XP_001234.1
4                ENSP00000124 XP_001267.1

Any help is greatly appreciated. Thanks in advance.

R • 8.8k views
ADD COMMENT
2
Entering edit mode
6.7 years ago
ddowlin ▴ 70

Well, I quickly found a solution using stringr and dplyr here.

library(stringr)
library(dplyr)

df1 <- 
df1 %>%
mutate(accession = str_extract(gene, "XP_[0-9]+\\.*[0-9]*"))

gives:

                              gene         accession
1                     ENSP00000123           <NA>
2      gi|1234567|ref|XP_001234.1|    XP_001234.1
3      gi|1234567|ref|XP_001267.1|    XP_001267.1
4                     ENSP00000124           <NA>
ADD COMMENT
0
Entering edit mode

Very impressive. It is much better than my solution. I have to read more about tidyverse

ADD REPLY
1
Entering edit mode
6.7 years ago
e.rempel ★ 1.1k

This is rather a question concerning R language, so you are advised to put it on StackOverflow. Here is my attempt to solve it assuming that the NCBI identifier is always on the same (in this case 4th) position (counting the | as separators):

position_id <- 4
df1$accession <- NA
df1$accession[grep(pattern="XP_", x=df1$gene)] <- limma::strsplit2(x=grep(pattern="XP_", x=df1$gene, value=T), split="\\|")[,position_id]
ADD COMMENT
0
Entering edit mode
6.7 years ago
$ library(stringr)
$ df1 <- data.frame(gene = c('ENSP00000123', 'gi|1234567|ref|XP_001234.1|','gi|1234567|ref|XP_001267.1|', 'ENSP00000124'), stringsAsFactors = F)
$ df1$acession=str_extract(ifelse(grepl("xp", ignore.case = T,df1$gene),df1$gene,NA),"XP_[0-9]+.[0-9]")

output:

> df1
                         gene    acession
1                ENSP00000123        <NA>
2 gi|1234567|ref|XP_001234.1| XP_001234.1
3 gi|1234567|ref|XP_001267.1| XP_001267.1
4                ENSP00000124        <NA>
ADD COMMENT
0
Entering edit mode
> library(tidyr)
> df1 <- data.frame(gene = c('ENSP00000123', 'gi|1234567|ref|XP_001234.1|','gi|1234567|ref|XP_001267.1|', 'ENSP00000124'), stringsAsFactors = F)
> df1$accession=separate(df1 , gene, sep = "\\|*\\|",c("","","",""))[,4]

output:

> df1
                         gene   accession
1                ENSP00000123        <NA>
2 gi|1234567|ref|XP_001234.1| XP_001234.1
3 gi|1234567|ref|XP_001267.1| XP_001267.1
4                ENSP00000124        <NA>
ADD REPLY

Login before adding your answer.

Traffic: 2491 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6