Join two files by matching columns
0
0
Entering edit mode
13 months ago

Hi

I am trying to join two files, one has 4 columns the other 3. I want to match the third column of the first to the second column on the second file and add the 3rd column of the second file to the first, when the data from each column match. I managed to do it but only when I output all columns from second file. Like this:

join -t' ' -1 3 -2 2 -o 1.1,1.2,1.3,1.4,2.1 join_2.txt join_1.txt > join_try_3.txt


If I run this, it doesn't work:

join -t' ' -1 3 -2 2 -o 1.1,1.2,1.3,1.4,2.3 join_2.txt join_1.txt > join_try_3.txt


Any ideas why? And suggestions to only add the the 3rd column of the second file would be appreciated.

join linux • 703 views
0
Entering edit mode

how is it related to bioinformatics ? (and BTW how did you sort join_* ?)

0
Entering edit mode

The 2 files are blast results in output format 6, I sorted each file by the columns I wanted to match.

0
Entering edit mode

The 2 files are blast results in output format 6, I sorted each file by the columns I wanted to match.

0
Entering edit mode

I sorted each file by the columns I wanted to match.

show us the command line please.

0
Entering edit mode

The previous values of the columns were of test files I created to practice on them before doing it on my files. My actual is the following: I checked my files and the blast results I think all lines are correct for join, lets call this file A. That file is like this:

 WP_002438140.1  1|3440624|6|Catalog-(metahit-v2)|MC3.MG6.AS1.GP1.C47040.G4|aac6    40.506  158     93      1       2       158     3       160     4.02e-38        126
WP_002438140.1  1|1186628|6|Catalog-(metahit-v2)|MC3.MG2.AS1.GP1.C18992.G2|aac6    39.623  159     94      2       1       158     17      174     8.48e-29        102
WP_002438140.1  1|1998910|6|Catalog-(metahit-v2)|MC3.MG293.AS1.GP1.C1957.G20|aac6    38.889  162     97      2       1       161     1       161     6.13e-28        100
WP_002438140.1  1|1715055|6|Catalog-(metahit-v2)|MC3.MG262.AS1.GP1.C7747.G3|aac6    35.625  160     98      4       2       158     7       164     4.24e-26        95.9
WP_002438140.1  1|1459548|6|Catalog-(metahit-v2)|MC3.MG232.AS1.GP1.C19886.G1|aac6    34.320  169     103     2       2       163     6       173     1.92e-24        92.0
WP_002438140.1  1|1444795|6|Catalog-(metahit-v2)|MC3.MG229.AS1.GP1.C15873.G17|aac6    35.152  165     105     1       2       164     14      178     2.56e-24        91.7
WP_002438140.1  1|2947214|6|Catalog-(metahit-v2)|MC3.MG373.AS1.GP1.C76048.G3|aac6    30.380  158     109     1       1       158     1       157     1.41e-23        89.0
WP_002438140.1  1|3816016|6|Catalog-(metahit-v2)|MC3.MG96.AS1.GP1.C24897.G23|aac6    31.013  158     108     1       1       158     1       157     8.44e-23        87.0
WP_002438140.1  1|735451|6|Catalog-(metahit-v2)|MC3.MG148.AS1.GP1.C6414.G2|aac6    32.515  163     101     3       1       158     1       159     1.15e-22        87.0
WP_002438140.1  1|3779723|6|Catalog-(metahit-v2)|MC3.MG94.AS1.GP1.C579.G4|aac6    31.707  164     105     2       1       161     32      191     1.41e-22        87.0


The other file, lets call it file B, some lines do not have all the columns :

    GCF_902459825.1 chromosome      1       1188    WP_156068365.1  WP_156068365.1  1187
GCF_902459825.1 chromosome      1716    2924    WP_010950321.1  WP_010950321.1  1208
GCF_902459825.1 chromosome      2944    4101    WP_003400273.1  WP_003400273.1  1157
GCF_902459825.1 chromosome      4098    4661    WP_003899769.1  WP_003899769.1  563
GCF_902459825.1 chromosome      4904    6931    WP_031652105.1  WP_031652105.1  2027
GCF_902459825.1 chromosome      6966    9482    WP_003400286.1  WP_003400286.1  2516
GCF_902459825.1 chromosome      9578    10492   WP_003400291.1  WP_003400291.1  914
GCF_902459825.1 chromosome      10889   11080   WP_003400294.1  WP_003400294.1  191
GCF_902459825.1 chromosome      11219   11356   WP_003400297.1  WP_003400297.1  137
GCF_902459825.1 chromosome      11538   11975   WP_003400307.1  WP_003400307.1  437
GCF_902459825.1 chromosome      12132   12680   WP_003400321.1  WP_003400321.1  548
GCF_902459825.1 chromosome      12797   13221                   424
GCF_902459825.1 chromosome      13377   13658   WP_003400344.1  WP_003400344.1  281
GCF_902459825.1 chromosome      13752   14540   WP_003902799.1  WP_003902799.1  788
GCF_902459825.1 chromosome      14577   15275   WP_003400352.1  WP_003400352.1  698
GCF_902459825.1 chromosome      15253   17133   WP_003400356.1  WP_003400356.1  1880
GCF_902459825.1 chromosome      17130   18425   WP_003400358.1  WP_003400358.1  1295


I want to match the fifth column in file B to the 1st column in file A and when it matches add column 7 to file A. Should I use an AWK instead of join?

0
Entering edit mode

show us the command line please.

0
Entering edit mode
S530FN:~\$ join -t' ' -1 1 -2 5 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.7 join_2.txt join_1.txt > join_try_3.txt


It gives this error:

join: invalid file number in field spec: ‘sample1_blast.out

0
Entering edit mode

my question 'show us the command line' was about how you sorted both files. Furthermore, from what you showed above, it's hard to say if you used a space or a tab. At this point, I'm leaving this thread.

0
Entering edit mode

Well I did not understand, I thought it was the command I was having issues with .

But here is the sort commands:

sort -k 1 sample1_blast.out

sort -k 5 sample2_my_files.txt