Question: arranging columns and rows
0
gravatar for Ambika
9 days ago by
Ambika20
United States/Auburn/Auburn University
Ambika20 wrote:

Hello everyone,

I have File 1 like this with 2 columns:

g4989   2.70224323450382
g4650   2.71483380183318
g11701  2.83907744860811
g11701  2.83907744860811
g3807   2.83912968405616
g17931  2.84821618321646

and File 2 like this with 4 columns

g4989
g4650  Pfam    PF00172 FungalZn(2)-Cys(6)binuclearclusterdomain
g11701  Pfam    PF04082 Fungalspecifictranscriptionfactordomain
g17931  Pfam    PF04082 Fungalspecifictranscriptionfactordomain

Both of the files are tab delimited. File 2 only contains the selective genes from File 1. I want The to add a second column from file1 to file 2 but only for the genes in file two like this:

    g4989     2.70 
    g4650     2.71         Pfam    PF00172 FungalZn(2)-Cys(6)binuclearclusterdomain
    g11701   2.83         Pfam    PF04082 Fungalspecifictranscriptionfactordomain
    g17931   2.84         Pfam    PF04082 Fungalspecifictranscriptionfactordomain

Could you please help me sort this out in linux.

Thank you, Ambika

bash awk grep • 91 views
ADD COMMENTlink modified 9 days ago by Pierre Lindenbaum106k • written 9 days ago by Ambika20

g11701 is present twice in file1. How should you handle this ?

ADD REPLYlink modified 9 days ago • written 9 days ago by Pierre Lindenbaum106k

Yes its present twice, and this is just a sample some of the genes might be present more than that because single gene might have different pfam domains.

ADD REPLYlink modified 9 days ago • written 9 days ago by Ambika20
2
gravatar for Pierre Lindenbaum
9 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum106k wrote:

use join

join -t $'\t' -1 1 -2 1 <(sort -t $'\t' -k1,1 file1.txt ) <(sort -t $'\t' -k1,1 file2.txt )

g11701  2.83907744860811    Pfam    PF04082 Fungalspecifictranscriptionfactordomain
g11701  2.83907744860811    Pfam    PF04082 Fungalspecifictranscriptionfactordomain
g17931  2.84821618321646    Pfam    PF04082 Fungalspecifictranscriptionfactordomain
g4650   2.71483380183318    Pfam    PF00172 FungalZn(2)-Cys(6)binuclearclusterdomain
g4989   2.70224323450382
ADD COMMENTlink written 9 days ago by Pierre Lindenbaum106k

Hi Pierre, Thank you but the problem is I got the output file but the number of rows that I have for output file is not the same as in File 2.

ADD REPLYlink written 9 days ago by Ambika20

it happens if , like in your example, there is a duplicated key : eg: g11701 . see also the option -v and -a of join

ADD REPLYlink written 9 days ago by Pierre Lindenbaum106k

Thank you so much for your help.

ADD REPLYlink written 9 days ago by Ambika20
1

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLYlink written 9 days ago by Pierre Lindenbaum106k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1588 users visited in the last hour