I have a data frame in R with a column of gene identifiers (taken from fasta headers). These vary and include both Ensembl style (e.g.
ENSP000000001) and NCBI style (e.g
gi|123|ref|XP_000001.1|) as well as others.
I want to extract the accession and version numbers from the NCBI identifiers and create a new column as part of my data frame. Non-NCBI identifiers would have an NA in this column.
For example, I would like to change the following data frame:
df1 <- data.frame(gene = c('ENSP00000123', 'gi|1234567|ref|XP_001234.1|', 'gi|1234567|ref|XP_001267.1|', 'ENSP00000124') )
gene accession 1 ENSP00000123 <NA> 2 gi|1234567|ref|XP_001234.1| XP_001234.1 3 gi|1234567|ref|XP_001267.1| XP_001267.1 4 ENSP00000124 <NA>
I have tried using
regmatches but this is not working the way I want it to.
df1$accession <- regmatches(df1$gene, regexpr("XP_[0-9]+\\.*[0-9]*", df1$gene)) # results in: gene accession 1 ENSP00000123 XP_001234.1 2 gi|1234567|ref|XP_001234.1| XP_001267.1 3 gi|1234567|ref|XP_001267.1| XP_001234.1 4 ENSP00000124 XP_001267.1
Any help is greatly appreciated. Thanks in advance.