How do I match two files using one column as reference?
2
0
Entering edit mode
3.9 years ago
eennadi ▴ 30

I have two files with with two columns the. File 1 CBS11016_00000185-RA h. CBS11016_00000186-RA h. CBS11016_00000187-RA h. CBS11016_00000188-RA h. CBS11016_00000189-RA h. CBS11016_00000190-RA h. CBS11016_00000191-RA h. CBS11016_00000192-RA h. CBS11016_00000193-RA h. CBS11016_00000194-RA h. CBS11016_00000195-RA h. CBS11016_00000196-RA h. CBS11016_00000197-RA h. CBS11016_00000198-RA h. CBS11016_00000199-RA h. CBS11016_00000200-RA h. CBS11016_00000201-RA h. CBS11016_00000202-RA h. CBS11016_00000203-RA h. CBS11016_00000204-RA h. CBS11016_00000205-RA h. CBS11016_00000206-RA h. CBS11016_00000207-RA h. CBS11016_00000208-RA h. CBS11016_00000209-RA h. CBS11016_00000210-RA h. CBS11016_00000211-RA h. CBS11016_00000212-RA h. CBS11016_00000213-RA h. CBS11016_00000214-RA h. CBS11016_00000215-RA h. CBS11016_00000216-RA h. CBS11016_00000217-RA h. CBS11016_00000218-RA h. CBS11016_00000219-RA h. CBS11016_00000220-RA h. CBS11016_00000221-RA h. CBS11016_00000222-RA h. CBS11016_00000223-RA h. CBS11016_00000224-RA h.

File 2

CBS11016_00000185-RA Dip5p CBS11016_00000194-RA hypothetical protein CBS11016_00000189-RA hypothetical protein CBS11016_00000215-RA Threonine--tRNA ligase CBS11016_00000199-RA NAD-dependent protein deacylase CBS11016_00000190-RA Cellulase domain-containing protein CBS11016_00000223-RA 40S ribosomal protein S16 CBS11016_00000224-RA hypothetical protein CBS11016_00000225-RA hypothetical protein CBS11016_00000226-RA hypothetical protein CBS11016_00000230-RA 40S ribosomal protein S1 CBS11016_00000231-RA hypothetical protein CBS11016_00000232-RA 60S ribosomal protein L6 CBS11016_00000233-RA FAS1 domain-containing protein CaO19.3004 CBS11016_00000235-RA hypothetical protein CBS11016_00000239-RA Octanoyltransferase CBS11016_00000245-RA hypothetical protein CBS11016_00000247-RA SsDNA-dependent ATPase CBS11016_00000248-RA MRP-S28 domain-containing protein CBS11016_00000249-RA hypothetical protein

I want a file that can use the column one of file 1 as a reference and match it with the equivalence in column 2 of file 2 I had used awk 'NR==FNR{a[$1]=$2;next} {print $0,a[$1]}' file1 flie2 > newfile

But it rather would combine the column 2 and shorten the length of the strings in file 2. Can any help with a better expression?

awk command line • 1.2k views
ADD COMMENT
0
Entering edit mode

Please format this example data using code button. In present format it is illegible. You can only provide a couple of rows.


code_formatting

Thank you!

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Check merge function in R.

ADD REPLY
0
Entering edit mode

try sqldf in R.

ADD REPLY
0
Entering edit mode

Take a look at the commandline tools sort and join.

ADD REPLY
0
Entering edit mode
3.9 years ago
gdaly9000 ▴ 10

I think Python Pandas offers a great interface to do these types of merges. The R Tidyverse has similar functionality (I believe dplyr). You can actually perform some relatively complicated merges with Pandas.

ADD COMMENT
0
Entering edit mode
3.9 years ago
Mensur Dlakic ★ 27k

Don't think I have ever seen so many people follow the thread where the question is unreadable. Or so many posters answer the unreadable question, for that matter. I will join in the fun.

This is a python solution, and it assumes that you have pandas installed. Let's say your two files are called file1.csv and file2.csv and that you want to merge them using the shared ID column.

import pandas as pd
f1 = pd.read_csv('file1.csv')
f2 = pd.read_csv('file2.csv')
f3 = f1.merge(f2, left_on='ID', right_on='ID', how='outer')
f3.to_csv('file3.csv', index=False)
ADD COMMENT

Login before adding your answer.

Traffic: 2542 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6