Question

arranging columns and rows

0

Entering edit mode

6.1 years ago

AP ▴ 80

Hello everyone,

I have File 1 like this with 2 columns:

g4989   2.70224323450382
g4650   2.71483380183318
g11701  2.83907744860811
g11701  2.83907744860811
g3807   2.83912968405616
g17931  2.84821618321646

and File 2 like this with 4 columns

g4989
g4650  Pfam    PF00172 FungalZn(2)-Cys(6)binuclearclusterdomain
g11701  Pfam    PF04082 Fungalspecifictranscriptionfactordomain
g17931  Pfam    PF04082 Fungalspecifictranscriptionfactordomain

Both of the files are tab delimited. File 2 only contains the selective genes from File 1. I want The to add a second column from file1 to file 2 but only for the genes in file two like this:

    g4989     2.70 
    g4650     2.71         Pfam    PF00172 FungalZn(2)-Cys(6)binuclearclusterdomain
    g11701   2.83         Pfam    PF04082 Fungalspecifictranscriptionfactordomain
    g17931   2.84         Pfam    PF04082 Fungalspecifictranscriptionfactordomain

Could you please help me sort this out in linux.

Thank you, Ambika

awk grep bash • 1.2k views

ADD COMMENT • link updated 6.1 years ago by Pierre Lindenbaum 161k • written 6.1 years ago by AP ▴ 80

0

Entering edit mode

g11701 is present twice in file1. How should you handle this ?

ADD REPLY • link 6.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Yes its present twice, and this is just a sample some of the genes might be present more than that because single gene might have different pfam domains.

ADD REPLY • link 6.1 years ago by AP ▴ 80

score 2 · Answer 1 · 2018-04-16

2

Entering edit mode

6.1 years ago

Pierre Lindenbaum 161k

use join

join -t $'\t' -1 1 -2 1 <(sort -t $'\t' -k1,1 file1.txt ) <(sort -t $'\t' -k1,1 file2.txt )

g11701  2.83907744860811    Pfam    PF04082 Fungalspecifictranscriptionfactordomain
g11701  2.83907744860811    Pfam    PF04082 Fungalspecifictranscriptionfactordomain
g17931  2.84821618321646    Pfam    PF04082 Fungalspecifictranscriptionfactordomain
g4650   2.71483380183318    Pfam    PF00172 FungalZn(2)-Cys(6)binuclearclusterdomain
g4989   2.70224323450382

ADD COMMENT • link 6.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre, Thank you but the problem is I got the output file but the number of rows that I have for output file is not the same as in File 2.

ADD REPLY • link 6.1 years ago by AP ▴ 80

0

Entering edit mode

it happens if , like in your example, there is a duplicated key : eg: g11701 . see also the option -v and -a of join

ADD REPLY • link 6.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you so much for your help.

ADD REPLY • link 6.1 years ago by AP ▴ 80

1

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLY • link 6.1 years ago by Pierre Lindenbaum 161k