Question

Merging two linux files based on same rows

0

Entering edit mode

3.0 years ago

salman_96 ▴ 70

Hi, I have two files, one file (file1.txt) has snp coordinates and rs IDs. This file is obtained from Hg19 file.

1:10150:C:T     rs371194064

1:10165:A:AC    rs796884232

1:10177:A:AC    rs367896724

1:10177:A:C     rs201752861

1:10180:T:C     rs201694901

1:10199:A:T     rs905327004

1:10228:TA:T    rs143255646

The other file (file2.txt) that I have has coordinates only and looks like this

1:838916:A:T

1:839461:T:C

1:839495:G:T

1:839528:A:G

1:839529:T:G

1:840353:G:C

I want to merge both files based on identical coordinates between them. The first file is a large file where as the second file has a few million snps. i want to find the snp rsIDs for the second file by merging it with the first

I have used this command but it generates a file that has everything in it including the content of file 2 but I only want the matching rows between them.

paste -d " "  file1.txt file2.txt > Final_rsIDs.txt

rsID coordinates plinkfile Merge • 1.1k views

ADD COMMENT • link updated 3.0 years ago by GenoMax 141k • written 3.0 years ago by salman_96 ▴ 70

0

Entering edit mode

$ join -1 1 -2 1 file1.txt file2.txt

join needs file to be sorted. If program throws an error on sorting, sort both the files and join.

ADD REPLY • link 3.0 years ago by cpad0112 21k

score 0 · Answer 1 · 2021-05-04

0

Entering edit mode

3.0 years ago

Zhilong Jia ★ 2.2k

awk 'NR==FNR{data[$1]=$2; next}{print ($1, data[$1]) }' file1.txt file2.txt

ADD COMMENT • link 3.0 years ago by Zhilong Jia ★ 2.2k

0

Entering edit mode

Common column is first column, not 2nd. Correct syntax would be awk 'FNR==NR {data[$1]=$1; next} $1 in data' file1.txt file2.txt. Since both the files do not share any common values, it would be blank.

$ awk 'NR==FNR{data[$1]=$2; next}{print ($1, data[$1]) }' file1.txt file2.txt

1:838916:A:T 
1:839461:T:C 
1:839495:G:T 
1:839528:A:G 
1:839529:T:G 
1:840353:G:C 

$ awk  'FNR==NR {data[$1]=$1; next} $1 in data' file1.txt file2.txt 

No output

If I add 1:10150:C:T from column 1 of file 1, to file2, output is as follows:

$ awk 'NR==FNR{data[$1]=$2; next}{print ($1, data[$1]) }' file1.txt file2.txt

1:838916:A:T 
1:839461:T:C 
1:839495:G:T 
1:839528:A:G 
1:839529:T:G 
1:840353:G:C 
1:10150:C:T rs371194064

$ awk  'FNR==NR {data[$1]=$1; next} $1 in data' file1.txt file2.txt           
1:10150:C:T

However user wants to print matching lines from file 1, it would be:

$ awk 'NR==FNR {a[$1]; next} $1 in a {print}' file2.txt file1.txt

ADD REPLY • link 3.0 years ago by cpad0112 21k

score 0 · Answer 2 · 2021-05-04

Just to check that I understand, you are trying to extract the rows from file1 that match the coordinates in file2 (rather than trying to paste the file2 rows together with the file1 rows)?

It sounds like you could do this with grep. You can give file2 to grep and then it will look through file1 to find any rows that match what is in file2:

grep -f file2.txt file1.txt > Final_rsIDs.txt

score 0 · Answer 3 · 2021-05-04

0

Entering edit mode

3.0 years ago

Carambakaracho ★ 3.2k

You can use grep with patterns from another file input

grep -f file2.txt file1.txt

You might want to play around with the patterns like adding a ^ in front to make it more explicit that a row must start with this pattern

sed 's/^/\^/' <file2.txt >file2.mod.txt

ADD COMMENT • link 3.0 years ago by Carambakaracho ★ 3.2k