Grep the first match for each line of a pattern file
3
0
Entering edit mode
5.5 years ago

Hi,

I have a little problem with the 'grep' tool. I have two files:

• pattern_file:

id_gene_1

id_gene_2

id_gene_3

• description_file:

id_gene_1 description_xxx

id_gene_2 description_yyy

id_gene_1 description_xxx

id_gene_3 description_zzz

id_gene_3 description_zzz

id_gene_2 description_yyy

I would like for each line of the 'pattern_file', look for the first match in the 'description_file'. I thought using the -f and -m grep option but I only get the first match.

Any idea ?

Grep • 7.0k views
2
Entering edit mode

It's a good practice to give an example of your expected output. This is very helpful to have the desired answer. For other hand I would recommend you to have a look to stackoverflow forum since I'm pretty sure that this kind of question has been already asked before.

0
Entering edit mode

i'm confused. you said you want first matched, but only get the first match??

0
Entering edit mode

To be more precise, the 'description_file' looks like this:

description_file:

id_gene_1 description_1-1
id_gene_2 description_2-1
id_gene_1 description_1-2
id_gene_3 description_3-1
id_gene_3 description_3-2
id_gene_2 description_2-2


So, as output, I would like to have :

id_gene_1 description_1-1
id_gene_2 description_2-1
id_gene_3 description_3-1

0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

In this case you should have edited your original post and added the information there.

1
Entering edit mode
5.5 years ago
iraun ★ 4.5k

This should work:

grep -f pattern_file description_file | awk -F" " '!_[\$1]++'


The grep command matches common lines between two files. The awk command prints out only the first match. Please, change -F variable if your field separator in the decription file is not a white space.

0
Entering edit mode
5.5 years ago

I would use a sort with option 'stable' and 'unique' followed by a join (here, using 'space' as the delimiter)

join -t ' ' -1 1 -2 1 <(sort pattern_file ) <(sort -t ' ' -k1,1 description --stable -u)

0
Entering edit mode
2.0 years ago
michael • 0

You were pretty close. This should do the trick.

xargs -I @ grep -w -m 1 @ description_file < pattern_file


You have to use xargs as using grep -m 1 on its own will stop printing any matches after the first one. Let's break down the command.

We pipe the pattern_file in to xargs with < pattern_file. The way I generally read a xargs statement is "for each line, do X". In this case, X is grep -w -m 1 @ description_file. The -I @ bit tells xargs that wherever I use the character @, insert the line (in this case the current pattern). As an example, if the current line being read from pattern_file was id_gene_2, then what xargs would execute is grep -w -m 1 id_gene_2 description_file. Lastly, the -w option tells grep "Select only those lines containing matches that form whole words." This is important because if your pattern is id_gene_1, without -w, grep would also match this pattern to id_gene_10 or 11 or 12 etc. as the pattern is present in them too.