Question

Alternative for grep in a for loop

0

Entering edit mode

6 months ago

Prangan ▴ 20

Hello Stars,

I have two files list1.txt and list2.txt which look like this:

cat list1.txt

AT4G38910 3:17541308-17542307
AT4G38910 3:17639717-17640716
AT4G24540 1:25400514-25401513
AT4G24540 1:3398359-3399358
AT1G27730 1:4463470-4463858
AT1G27730 1:10073550-10074358

cat list1.txt | wc -l
650000

and

cat list2.txt
MYB94 AT3G47600 3:17541308-17542307
VPS29 AT3G47810 3:17639717-17640716
GSTU17 AT1G10370 1:3398359-3399358
CYP71B29 AT1G13100 1:4463470-4463858
AT1G28660 AT1G28660 1:10073550-10074358
BPC5 AT4G38910 4:18147081-18147080
AGL24 AT4G24540 4:12674107-12675106
ZAT10 AT1G27730 1:9649324-9650323

cat list2.txt | wc -l
5000

I am trying to grep the list1 entries columnwise from the list2 entries to get an output like:

BPC5 AT4G38910 MYB94 AT3G47600 3:17541308-17542307
BPC5 AT4G38910 VPS29 AT3G47810 3:17639717-17640716
AGL24 AT4G24540 AT1G67750 AT1G67750 1:25400514-25401513
AGL24 AT4G24540 GSTU17 AT1G10370 1:3398359-3399358
ZAT10 AT1G27730 CYP71B29 AT1G13100 1:4463470-4463858
ZAT10 AT1G27730 AT1G28660 AT1G28660 1:10073550-10074358

For which I am doing:

for i in `cat list1.txt | awk '{print $1}'`; do grep $i list2.txt ; done | awk '{print $1,$2}' > l1.txt
for i in `cat list1.txt | awk '{print $2}'`; do grep $i list2.txt ; done > l2.txt
paste l1.txt l2.txt > results.txt

But I am aware that grep is unsuitable for this operation and is taking a lot of time to generate the output. I am looking for an alternative for doing this (maybe awk?) or maybe parallelizing this using xargs or parallel. Any help is highly appreciated.

linux • 765 views

ADD COMMENT • link updated 6 months ago by Ram 43k • written 6 months ago by Prangan ▴ 20

2

Entering edit mode

you don't want grep you want join. https://linux.die.net/man/1/join

ADD REPLY • link updated 6 months ago by Ram 43k • written 6 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks for the reply. But in my case, both the columns in list1.txt contain repetitions, and the column1-column2 elements (which are meant to be network edges) in list1.txt do not correspond to the elements in list2.txt (which is kinda like an alias file for the network). What I want is:

for each entry of column1 (list1.txt), if entry matches with column2 (of list2.txt), then print column1 and column2 of list2.txt > output1
for each entry of column2 (list1.txt), if entry matches with column3 (of list2.txt), then print all columns of list2.txt > output2
merge output1 & output2 to produce a 5 column output3

I am not sure if join can make that happen, considering I have repetitions in list1. Again, thanks for the help.

ADD REPLY • link 6 months ago by Prangan ▴ 20

1

Entering edit mode

join handles repeats:

$ join -t. -1 1 -2 1 <(echo -e "A.1\nA.2\nA.3\nB.4") <(echo -e "A.X\nA.Y\nA.Z\nB.X")
A.1.X
A.1.Y
A.1.Z
A.2.X
A.2.Y
A.2.Z
A.3.X
A.3.Y
A.3.Z
B.4.X

blablablablablabla biostars wants text

ADD REPLY • link 6 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Side note: backticks are a legacy way of performing command substitutions. It's time to move on to the $() way of doing this - it is a lot more elegant and what's more, it can be nested. Also, read up on UUoC. You could literally just use cut -f1 list1.txt instead of cat list1.txt | awk '{print $1}'.

ADD REPLY • link 6 months ago by Ram 43k

0

Entering edit mode

Why don't you left join on R or python?

ADD REPLY • link 6 months ago by Miguel ▴ 20

1

Entering edit mode

why reinventing the wheel ?

ADD REPLY • link 6 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

unix join is good enough for simple use cases. R/python are better for more complicated cases or for re-runnable pipelines.

ADD REPLY • link 6 months ago by Ram 43k