How to create a faster way to match a list of string to a list of patterns in r?
1
0
Entering edit mode
5.8 years ago
sckinta ▴ 730

As the title, I have a list of strings (> 10,000). For each in string list, I want to know whether it can match any pattern in my pattern list (>10,000). The way I came up with is.

# CREATE FUNC TO DETECT MATCH FOR EACH STR
any_match <- function(str) {
  any(sapply(pattern_list, function(x){str_detect(str, x)}))
}

# SAPPLY EACH ELEMENT IN STRING LIST TO any_match FUNC
sapply(str_list, any_match)

It works, but super slow. Is there any quicker way to do it?

R string • 12k views
ADD COMMENT
0
Entering edit mode

Please give an example of both lists and an example of what you would consider a valid match. Using two sapply in such a short command is always suspicious, I am sure we can find a faster solution.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I can't load string_list.txt in R. Complaining of duplicates.

ADD REPLY
0
Entering edit mode

I tried to use separate(sample, c('description','id','Source_Name'), sep="\.") to get the last part of pattern, however, since the sample str (string_list) is messy, it causes warnings. I went to manually solve it. but I am looking for an automatic way to do this.

ADD REPLY
0
Entering edit mode

Ok, this string_list.txt that you provided is super messy, is there a formatting mistake?

ADD REPLY
0
Entering edit mode
5.8 years ago
Michael 54k

Is this trying to select some sample information? Is the pattern a regular expression? If not:

See ?pmatch or as simple as:

  pmatch(pattern.list, my.string.vector);

This might do it, but I am not totally sure which output format you want, then maybe grep or grepl will provide more control.

I thought at first you were matching regular expressions, then I'd try ?grep for a start, it is vectorized on the search strings.

Something along the lines:

 sapply (pattern.list, grep, my.string.vector) # or
 sapply (pattern.list, function(p) {
                                    any(grepl(p,my.string.vector))
                                    } )

Haven't tested so expect some tweaking to fit your data structures, try to read your data simply using scan(), not need to make lists here. Should be around 10- 100 times faster than the double sapply.

ADD COMMENT
0
Entering edit mode

i think grep is better than pmatch. When there are multiple partial matches, pmatch returns 0. In addition, pmatch prefers full match over partial match. Because it is forward search, pmatch fails to give desired output some times compared to grep.

ADD REPLY

Login before adding your answer.

Traffic: 2742 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6