Question: Grep the first match for each line of a pattern file
0
gravatar for cyril.noel10
3.5 years ago by
cyril.noel100 wrote:

Hi,

I have a little problem with the 'grep' tool. I have two files:


  • pattern_file:

id_gene_1

id_gene_2

id_gene_3

  • description_file:

id_gene_1 description_xxx

id_gene_2 description_yyy

id_gene_1 description_xxx

id_gene_3 description_zzz

id_gene_3 description_zzz

id_gene_2 description_yyy


I would like for each line of the 'pattern_file', look for the first match in the 'description_file'. I thought using the -f and -m grep option but I only get the first match.

Any idea ?

Thanks in advance

grep • 4.0k views
ADD COMMENTlink modified 8 days ago by michael0 • written 3.5 years ago by cyril.noel100
2

It's a good practice to give an example of your expected output. This is very helpful to have the desired answer. For other hand I would recommend you to have a look to stackoverflow forum since I'm pretty sure that this kind of question has been already asked before.

ADD REPLYlink written 3.5 years ago by iraun3.7k

i'm confused. you said you want first matched, but only get the first match??

ADD REPLYlink written 3.5 years ago by shenwei3565.2k

To be more precise, the 'description_file' looks like this:

description_file:

id_gene_1 description_1-1
id_gene_2 description_2-1
id_gene_1 description_1-2
id_gene_3 description_3-1
id_gene_3 description_3-2
id_gene_2 description_2-2

So, as output, I would like to have :

id_gene_1 description_1-1
id_gene_2 description_2-1
id_gene_3 description_3-1
ADD REPLYlink modified 8 days ago by RamRS27k • written 3.5 years ago by cyril.noel100

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

In this case you should have edited your original post and added the information there.

ADD REPLYlink written 3.5 years ago by genomax85k
1
gravatar for iraun
3.5 years ago by
iraun3.7k
Norway
iraun3.7k wrote:

This should work:

grep -f pattern_file description_file | awk -F" " '!_[$1]++'

The grep command matches common lines between two files. The awk command prints out only the first match. Please, change -F variable if your field separator in the decription file is not a white space.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by iraun3.7k
0
gravatar for Pierre Lindenbaum
3.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

I would use a sort with option 'stable' and 'unique' followed by a join (here, using 'space' as the delimiter)

join -t ' ' -1 1 -2 1 <(sort pattern_file ) <(sort -t ' ' -k1,1 description --stable -u)
ADD COMMENTlink written 3.5 years ago by Pierre Lindenbaum129k
0
gravatar for michael
8 days ago by
michael0
michael0 wrote:

You were pretty close. This should do the trick.

xargs -I @ grep -w -m 1 @ description_file < pattern_file

You have to use xargs as using grep -m 1 on its own will stop printing any matches after the first one. Let's break down the command.

We pipe the pattern_file in to xargs with < pattern_file. The way I generally read a xargs statement is "for each line, do X". In this case, X is grep -w -m 1 @ description_file. The -I @ bit tells xargs that wherever I use the character @, insert the line (in this case the current pattern). As an example, if the current line being read from pattern_file was id_gene_2, then what xargs would execute is grep -w -m 1 id_gene_2 description_file. Lastly, the -w option tells grep "Select only those lines containing matches that form whole words." This is important because if your pattern is id_gene_1, without -w, grep would also match this pattern to id_gene_10 or 11 or 12 etc. as the pattern is present in them too.

ADD COMMENTlink written 8 days ago by michael0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1322 users visited in the last hour