Question

Place wordcount next to each word in a paragraph of text

1

Entering edit mode

5.0 years ago

zack.saud ▴ 50

Hi, Is there any way to place the count of each word next to the word in GREP SED or AWK (or is there any software with this functionality built in)?

For example, given this text:

"I love the power of Grep SED and AWK, but I am no good at using it"

I'd like to have this:

"I(1) love(2) the(3) power(4) of(5) Grep(6) SED(7) and(8) AWK,(9) but(10) I(11) am(12) no(13) good(14) at(15) using(16) it(17)"

Thanks a million in advance

Zack

sequence • 2.2k views

ADD COMMENT • link updated 5.0 years ago by Kevin Blighe 89k • written 5.0 years ago by zack.saud ▴ 50

2

Entering edit mode

This post is in no obvious way related to bioinformatics. It is not even related to biology. Given how quickly in most circumstances the moderators shut down unrelated posts - and even post related to biology but not to bioinformatics - I can only conclude that you are showing off. That will surely offend someone who has already contributed in this thread - and I am grateful for your useful code - but think about how many posters that have much better case than this one are offended when you close their threads.

ADD REPLY • link 5.0 years ago by Mensur Dlakic ★ 29k

2

Entering edit mode

You are right that this question is probably offtopic (OP has been posting bioinformatic questions before, leading me to assume there must be some biological context here.), and could have been closed for that reason. It is, however, a delicate balance between "things we don't want on biostars" and "things which are not the main focus but for which the community can easily help ". Personally I think pure coding questions fit better here than biology, but that's disputable. Surely it is interesting to see how such a thing can be done in multiple languages - but that's also out of scope for biostars.

ADD REPLY • link 5.0 years ago by WouterDeCoster 48k

2

Entering edit mode

Agreed! It's also nice to have a bit of fun on here. Seems like we got a nice collection of people approaching this with their preferred language. And it got me to procrastinate some work for a couple of more minutes! :)

ADD REPLY • link 5.0 years ago by bioinformatics2020 ▴ 840

0

Entering edit mode

You are right, and I was inches away from closing the post outright. I wrote a solution out of my own interest which I wasn't going to post. We started to discuss on the Slack group and others provided some comparable code. It was decided at that point that we might as well post them since people had written them for their own edification.

But make no mistake, this thread is not forum-suitable in its current form. It may be some sort of legitimate biological/bioinformatic task that has been heavily abstracted, but if thats the case, that is why I asked OP to elaborate.

Threads like this should and will remain the exception rather than the rule however.

ADD REPLY • link 5.0 years ago by Joe 22k

0

Entering edit mode

Can you explain how this is a bioinformatics question and why you'd want to do this?

ADD REPLY • link 5.0 years ago by Joe 22k

score 4 · Answer 1 · 2020-07-16

Using translate split every word to a new line, then awk to add row numbers NR, then translate again new lines into spaces:

$ echo "I love the power of Grep SED and AWK, but I am no good at using it" | \
  tr -s ' ' '\n' | \
  awk '{print $1"("NR")"}' | tr -s '\n' ' '

I(1) love(2) the(3) power(4) of(5) Grep(6) SED(7) and(8) AWK,(9) but(10) I(11) am(12) no(13) good(14) at(15) using(16) it(17)

Or using R:

x <- "I love the power of Grep SED and AWK, but I am no good at using it"
y <- unlist(strsplit(x, " "))

paste(paste0(y, "(", seq_along(y), ")"), collapse = " ")
# [1] "I(1) love(2) the(3) power(4) of(5) Grep(6) SED(7) and(8) AWK,(9) but(10) I(11) am(12) no(13) good(14) at(15) using(16) it(17)"

score 3 · Answer 2 · 2020-07-16

3

Entering edit mode

5.0 years ago

WouterDeCoster 48k

Not sure if this is ontopic, but hey I wrote two lines of python:

for line in open(myfile):
  print(' '.join([f"{word}({index})"  for index, word in enumerate(line.split())]))

ADD COMMENT • link 5.0 years ago by WouterDeCoster 48k

score 3 · Answer 3 · 2020-07-16

Well since others are offering solutions I might as well provide the code I wrote out of interest. It uses pure bash, so not awk, grep or sed as requested, but comparable in accessibility etc.

Intially I misread the question and thought you wanted word lengths not just index positions, so I have 2 bits of code ¯\_(ツ)_/¯

Index position:

#!/bin/bash

newstring=""
while IFS=' ' read -r -a array ; do
  for i in "${!array[@]}" ; do
     newstring+="${array[i]}(${i}) "
  done
done <<< "I love the power of Grep SED and AWK, but I am no good at using it"
echo "${newstring%"${newstring##*[![:space:]]}"}"

I(0) love(1) the(2) power(3) of(4) Grep(5) SED(6) and(7) AWK,(8) but(9) I(10) am(11) no(12) good(13) at(14) using(15) it(16)

(Results in zero based numbering but I assume that's not a deal breaker)

Word length:

#!/bin/bash

newstring=""
while IFS=' ' read -r -a array ; do
  for i in "${array[@]}" ; do
     newstring+="${i}(${#i}) "
  done
done <<< "I love the power of Grep SED and AWK, but I am no good at using it"
echo "${newstring%"${newstring##*[![:space:]]}"}"

I(1) love(4) the(3) power(5) of(2) Grep(4) SED(3) and(3) AWK,(4) but(3) I(1) am(2) no(2) good(4) at(2) using(5) it(2)

score 2 · Answer 4 · 2020-07-16

2

Entering edit mode

5.0 years ago

bioinformatics2020 ▴ 840

Edit: Just like Joe above, I also thought the initial question was asking for the word length, not index. Here's my response in R.

By Index

library(tidyverse)
library(glue)
str_org <- "I love the power of Grep SED and AWK, but I am no good at using it" %>% str_split(" ")
str <- as.vector(glue("({1:length(str_org[[1]])})"))
str <- glue_collapse(paste(str_org[[1]],as.vector(str)), sep = " ")
print(str)

Result

I (1) love (2) the (3) power (4) of (5) Grep (6) SED (7) and (8) AWK, (9) but (10) I (11) am (12) no (13) good (14) at (15) using (16) it (17)

By Word Length

library(tidyverse)
library(glue)
str_org <- "I love the power of Grep SED and AWK, but I am no good at using it" %>% str_split(" ")
str <- as.vector(glue("({str_count(str_org[[1]])})"))
str <- glue_collapse(paste(str_org[[1]],as.vector(str)), sep = " ")
print(str)

The result

I (1) love (4) the (3) power (5) of (2) Grep (4) SED (3) and (3) AWK, (4) but (3) I (1) am (2) no (2) good (4) at (2) using (5) it (2)

By Word Length, Not Counting Periods or Commas

library(tidyverse)
library(glue)
str_org <- "I love the power of Grep SED and AWK, but I am no good at using it" %>% str_split(" ")
str_count <- str_count(str_replace(str_org[[1]],c(",|\\."),""))
str <- as.vector(glue("({str_count})"))
str <- glue_collapse(paste(str_org[[1]],as.vector(str)), sep = " ")
print(str)

ADD COMMENT • link 5.0 years ago by bioinformatics2020 ▴ 840

1

Entering edit mode

Nice user of the headers via -------, by the way

Header

ADD REPLY • link 5.0 years ago by Kevin Blighe 89k

0

Entering edit mode

You can also use ### strings of several lengths for different size headers, as per normal markdown

ADD REPLY • link 5.0 years ago by Joe 22k

0

Entering edit mode

Not quite, as it's not the length of the words but the order number in the sentence (per the example)

ADD REPLY • link 5.0 years ago by WouterDeCoster 48k

1

Entering edit mode

Yeah, just like Joe above, I also misread it. Edited it for index as well. ;)

ADD REPLY • link 5.0 years ago by bioinformatics2020 ▴ 840

score 2 · Answer 5 · 2020-07-16

2

Entering edit mode

5.0 years ago

Kevin Blighe 89k

echo "I love the power of Grep SED and AWK, but I am no good at using it" | \
  awk -F " " '{for (i=1; i<=NF; i++) {printf $(i)"("i") "; if (i==NF) {print "\r"}}}'

I(1) love(2) the(3) power(4) of(5) Grep(6) SED(7) and(8) AWK,(9) but(10) I(11) am(12) no(13) good(14) at(15) using(16) it(17)

ADD COMMENT • link 5.0 years ago by Kevin Blighe 89k