Question: arranging columns and rows
0
gravatar for Ambika
2.2 years ago by
Ambika30
United States
Ambika30 wrote:

Hello everyone,

I have File 1 like this with 2 columns:

g4989   2.70224323450382
g4650   2.71483380183318
g11701  2.83907744860811
g11701  2.83907744860811
g3807   2.83912968405616
g17931  2.84821618321646

and File 2 like this with 4 columns

g4989
g4650  Pfam    PF00172 FungalZn(2)-Cys(6)binuclearclusterdomain
g11701  Pfam    PF04082 Fungalspecifictranscriptionfactordomain
g17931  Pfam    PF04082 Fungalspecifictranscriptionfactordomain

Both of the files are tab delimited. File 2 only contains the selective genes from File 1. I want The to add a second column from file1 to file 2 but only for the genes in file two like this:

    g4989     2.70 
    g4650     2.71         Pfam    PF00172 FungalZn(2)-Cys(6)binuclearclusterdomain
    g11701   2.83         Pfam    PF04082 Fungalspecifictranscriptionfactordomain
    g17931   2.84         Pfam    PF04082 Fungalspecifictranscriptionfactordomain

Could you please help me sort this out in linux.

Thank you, Ambika

bash awk grep • 599 views
ADD COMMENTlink modified 2.2 years ago by Pierre Lindenbaum129k • written 2.2 years ago by Ambika30

g11701 is present twice in file1. How should you handle this ?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Pierre Lindenbaum129k

Yes its present twice, and this is just a sample some of the genes might be present more than that because single gene might have different pfam domains.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Ambika30
2
gravatar for Pierre Lindenbaum
2.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

use join

join -t $'\t' -1 1 -2 1 <(sort -t $'\t' -k1,1 file1.txt ) <(sort -t $'\t' -k1,1 file2.txt )

g11701  2.83907744860811    Pfam    PF04082 Fungalspecifictranscriptionfactordomain
g11701  2.83907744860811    Pfam    PF04082 Fungalspecifictranscriptionfactordomain
g17931  2.84821618321646    Pfam    PF04082 Fungalspecifictranscriptionfactordomain
g4650   2.71483380183318    Pfam    PF00172 FungalZn(2)-Cys(6)binuclearclusterdomain
g4989   2.70224323450382
ADD COMMENTlink written 2.2 years ago by Pierre Lindenbaum129k

Hi Pierre, Thank you but the problem is I got the output file but the number of rows that I have for output file is not the same as in File 2.

ADD REPLYlink written 2.2 years ago by Ambika30

it happens if , like in your example, there is a duplicated key : eg: g11701 . see also the option -v and -a of join

ADD REPLYlink written 2.2 years ago by Pierre Lindenbaum129k

Thank you so much for your help.

ADD REPLYlink written 2.2 years ago by Ambika30
1

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLYlink written 2.2 years ago by Pierre Lindenbaum129k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 791 users visited in the last hour