Question

obtaining first two words from characters of the data frame

0

Entering edit mode

2.1 years ago

Ne ▴ 10

I would like to extract the first two words from characters. For example,

y <- data.frame(name = c('london hilss sff', 'newyork hills fff', 'paris'))

I want to get words less or equal 2;

name

'london hilss'

'newyork hills'

'paris'

R • 672 views

ADD COMMENT • link updated 2.1 years ago by cpad0112 21k • written 2.1 years ago by Ne ▴ 10

score 2 · Answer 1 · 2022-03-03

2

Entering edit mode

2.1 years ago

Malcolm.Cook ★ 1.5k

gsub('^(\\S*\\s*\\S*).*$','\\1',y$name)
[1] "london hilss"  "newyork hills" "paris"

regular expressions FTW!

edit: used \\S to capture "words" instead of \\w, allowing all non-whitespace characters to be part of "words"

ADD COMMENT • link 2.1 years ago by Malcolm.Cook ★ 1.5k

score 1 · Answer 2 · 2022-03-02

split the character vector by space, then get first two.

> y <- data.frame(name = c('london hilss sff', 'newyork hills fff', 'paris'))
> library(stringr)
> library(tidyr)
> str_to_sentence(unite(data.frame(str_split(y$name," ",3, simplify = T)[,c(1:2)]), "new", sep = " ")$new)
[1] "London hilss"  "Newyork hills" "Paris "

score 1 · Answer 3 · 2022-03-02

Edit: I think I prefer Malcolm's response below! Much shorter and simpler, although maybe less readable.

You can split like suggested by cpad--that's simplest.

Like this:

> firstN <- function(x, n) {
     words <- strsplit(x, " ")[[1]]
     paste(words[1:min(2, length(words))], collapse = " ")
 }

> sapply(y$name, FUN = function(x) firstN(x, 2), USE.NAMES = F)
[1] "london hilss"  "newyork hills" "paris"

I had to make firstN because if you ask for c("Test")[1:2], for example, you'll get an NA.

Alternatively you can use the word function from stringr.

The base function works for strings that have at least two words:

> library(stringr)
> y <- data.frame(name = c('london hilss sff', 'newyork hills fff', 'paris'))
> word(y$name, 1, 2)
[1] "london hilss"  "newyork hills" NA

Although unfortunately it doesn't work for just one word.

You can hack together something that fixes that, though, like this:

words_or_fewer <- function(str, n) {
    answer <- word(str, start = 1, end = n)

    while(n > 0) {
        # If the answer is NA, try to get fewer words
        if(is.na(answer)) {
            n <- n - 1
            answer <- word(str, start = 1, end = n)
        } else {
            break()
        }
    }
    answer
}

# Just a wrapper to use words_or_fewer with vectors
words_or_fewer_vec <- function(str_vec, n) {
    sapply(str_vec, FUN = function(str) words_or_fewer(str, n), USE.NAMES = F)
}

> words_or_fewer_vec(y$name, 2)
[1] "london hilss"  "newyork hills" "paris"