how to write a script to grep information from another file without losing location information(in the right order)
2
1
Entering edit mode
3.7 years ago
Alexie Li ▴ 20

Hi all

I am too new with programing to solve this problem. I have two files, file1 containing the index, file2 includes information I want.

 1. file1

CU_91

CU_495

CW_79

CU_22

CW_42

 2. file2

CW_79   protein1

CW_15  protein2

CW_16  protein3

CW_17   protein4

CW_42   protein5

I want to add extra information from file 2 to file 1 without changing the order in file one, as following. How could I do that?

CU_91

CU_495

CW_79  protein1

CU_22

CW_42 protein5

Thank you!

Alexie

linux script • 1.0k views
ADD COMMENT
3
Entering edit mode
3.7 years ago
russhh 5.5k

What you've described is a left-outer-join of the data in file1 with the data in file2. Have a look at the join command (example). If your file2 is tab-separated, I think you do the following:

join -t $'\t' file1 file2 -a1

For example,

echo -e "A\nB\nC" > f1
cat f1
A
B
C

echo -e "A\tP1\nC\tP2\nD\tP3" > f2
cat f2
A    P1
C    P2
D    P3



join -t $'\t' f1 f2 -a1 

A    P1
B
C    P2

The syntax is a bit awkward for specifying the separator IMO

ADD COMMENT
1
Entering edit mode

Hi russhh!

Thank you for your help.

I tried this method but failed to get the result, I think there are two problems 1)I can't sort file 1 since I need the order information 2)For some reason, my system is not recognizing "join -t $'\t'" and gave the error message "join: illegal tab character specification". I changed file two with command sed 's/ /\t/g'

ADD REPLY
1
Entering edit mode

I believe 'join' requires the input to be sorted, but Alexei want's to maintain the order.

I don't know of a good way to do it that doesn't require writing a program and keeping stuff in memory (or something similar).

ADD REPLY
3
Entering edit mode
3.7 years ago

assuming the tab is the delimiter. The first awk is used to keep the line number of the first file.

 join -t $'\t' -a 1 -1 2 -2 1  \
         <(awk '{printf("%d\t%s\n",NR,$1);}' file.1  | sort -t $'\t' -k2,2) \
         <(sort -t $'\t' -k1,1 file.2) |\
      sort -t $'\t' -k2,2n | cut -f 1,3

CU_91
CU_495
CW_79   protein1
CU_22
CW_42   protein5
ADD COMMENT
0
Entering edit mode

It worked! Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 2487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6