Question: Populate data frame column X with substring from column Y using R
0
gravatar for ddowlin
3.5 years ago by
ddowlin70
ddowlin70 wrote:

Hi all,

I have a data frame in R with a column of gene identifiers (taken from fasta headers). These vary and include both Ensembl style (e.g. ENSP000000001) and NCBI style (e.g gi|123|ref|XP_000001.1|) as well as others.

I want to extract the accession and version numbers from the NCBI identifiers and create a new column as part of my data frame. Non-NCBI identifiers would have an NA in this column.

For example, I would like to change the following data frame:

df1 <- data.frame(gene = c('ENSP00000123', 'gi|1234567|ref|XP_001234.1|',
                           'gi|1234567|ref|XP_001267.1|', 'ENSP00000124')
                  )

To this:

                             gene   accession
1                ENSP00000123        <NA>
2 gi|1234567|ref|XP_001234.1| XP_001234.1
3 gi|1234567|ref|XP_001267.1| XP_001267.1
4                ENSP00000124        <NA>

I have tried using regmatches but this is not working the way I want it to.

df1$accession <- regmatches(df1$gene, regexpr("XP_[0-9]+\\.*[0-9]*", df1$gene))

# results in:

                         gene   accession
1                ENSP00000123 XP_001234.1
2 gi|1234567|ref|XP_001234.1| XP_001267.1
3 gi|1234567|ref|XP_001267.1| XP_001234.1
4                ENSP00000124 XP_001267.1

Any help is greatly appreciated. Thanks in advance.

R • 7.1k views
ADD COMMENTlink modified 3.5 years ago by cpad011215k • written 3.5 years ago by ddowlin70
2
gravatar for ddowlin
3.5 years ago by
ddowlin70
ddowlin70 wrote:

Well, I quickly found a solution using stringr and dplyr here.

library(stringr)
library(dplyr)

df1 <- 
df1 %>%
mutate(accession = str_extract(gene, "XP_[0-9]+\\.*[0-9]*"))

gives:

                              gene         accession
1                     ENSP00000123           <NA>
2      gi|1234567|ref|XP_001234.1|    XP_001234.1
3      gi|1234567|ref|XP_001267.1|    XP_001267.1
4                     ENSP00000124           <NA>
ADD COMMENTlink written 3.5 years ago by ddowlin70

Very impressive. It is much better than my solution. I have to read more about tidyverse

ADD REPLYlink written 3.5 years ago by e.rempel1000
1
gravatar for e.rempel
3.5 years ago by
e.rempel1000
Germany, Heidelberg
e.rempel1000 wrote:

This is rather a question concerning R language, so you are advised to put it on StackOverflow. Here is my attempt to solve it assuming that the NCBI identifier is always on the same (in this case 4th) position (counting the | as separators):

position_id <- 4
df1$accession <- NA
df1$accession[grep(pattern="XP_", x=df1$gene)] <- limma::strsplit2(x=grep(pattern="XP_", x=df1$gene, value=T), split="\\|")[,position_id]
ADD COMMENTlink written 3.5 years ago by e.rempel1000
0
gravatar for cpad0112
3.5 years ago by
cpad011215k
Hyderabad India
cpad011215k wrote:
$ library(stringr)
$ df1 <- data.frame(gene = c('ENSP00000123', 'gi|1234567|ref|XP_001234.1|','gi|1234567|ref|XP_001267.1|', 'ENSP00000124'), stringsAsFactors = F)
$ df1$acession=str_extract(ifelse(grepl("xp", ignore.case = T,df1$gene),df1$gene,NA),"XP_[0-9]+.[0-9]")

output:

> df1
                         gene    acession
1                ENSP00000123        <NA>
2 gi|1234567|ref|XP_001234.1| XP_001234.1
3 gi|1234567|ref|XP_001267.1| XP_001267.1
4                ENSP00000124        <NA>
ADD COMMENTlink written 3.5 years ago by cpad011215k
> library(tidyr)
> df1 <- data.frame(gene = c('ENSP00000123', 'gi|1234567|ref|XP_001234.1|','gi|1234567|ref|XP_001267.1|', 'ENSP00000124'), stringsAsFactors = F)
> df1$accession=separate(df1 , gene, sep = "\\|*\\|",c("","","",""))[,4]

output:

> df1
                         gene   accession
1                ENSP00000123        <NA>
2 gi|1234567|ref|XP_001234.1| XP_001234.1
3 gi|1234567|ref|XP_001267.1| XP_001267.1
4                ENSP00000124        <NA>
ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by cpad011215k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1055 users visited in the last hour
_