how to write a script to grep information from another file without losing location information(in the right order)
2
1
Entering edit mode
3.7 years ago
Alexie Li ▴ 20

Hi all

I am too new with programing to solve this problem. I have two files, file1 containing the index, file2 includes information I want.

 1. file1

CU_91

CU_495

CW_79

CU_22

CW_42

2. file2

CW_79   protein1

CW_15  protein2

CW_16  protein3

CW_17   protein4

CW_42   protein5


I want to add extra information from file 2 to file 1 without changing the order in file one, as following. How could I do that?

CU_91

CU_495

CW_79  protein1

CU_22

CW_42 protein5


Thank you!

Alexie

linux script • 1.0k views
3
Entering edit mode
3.7 years ago
russhh 5.5k

What you've described is a left-outer-join of the data in file1 with the data in file2. Have a look at the join command (example). If your file2 is tab-separated, I think you do the following:

join -t $'\t' file1 file2 -a1  For example, echo -e "A\nB\nC" > f1 cat f1 A B C echo -e "A\tP1\nC\tP2\nD\tP3" > f2 cat f2 A P1 C P2 D P3 join -t$'\t' f1 f2 -a1

A    P1
B
C    P2


The syntax is a bit awkward for specifying the separator IMO

1
Entering edit mode

Hi russhh!

I tried this method but failed to get the result, I think there are two problems 1)I can't sort file 1 since I need the order information 2)For some reason, my system is not recognizing "join -t $'\t'" and gave the error message "join: illegal tab character specification". I changed file two with command sed 's/ /\t/g' ADD REPLY 1 Entering edit mode I believe 'join' requires the input to be sorted, but Alexei want's to maintain the order. I don't know of a good way to do it that doesn't require writing a program and keeping stuff in memory (or something similar). ADD REPLY 3 Entering edit mode 3.7 years ago assuming the tab is the delimiter. The first awk is used to keep the line number of the first file.  join -t$'\t' -a 1 -1 2 -2 1  \
<(awk '{printf("%d\t%s\n",NR,$1);}' file.1 | sort -t$'\t' -k2,2) \
<(sort -t $'\t' -k1,1 file.2) |\ sort -t$'\t' -k2,2n | cut -f 1,3

CU_91
CU_495
CW_79   protein1
CU_22
CW_42   protein5

0
Entering edit mode

It worked! Thank you.