Question

compare two file to retrieve common lines

1

Entering edit mode

5.5 years ago

Sam ▴ 150

Dear ALL

I have two tab file:

file 1 

    Chromosome  Position    
    Chr01   45943   
    Chr01   45965
    Chr01   45981   
    Chr01   46122   
    Chr02   45965

 file 2
        Chr01   6789    SNP A   T   15.17   90.91   6   18  6
        Chr01   6795    SNP G   T   12.11   81.82   6   17  4
       Chr01    45965   SNP G   C   26.33   100.00  6   21  6

I need out put as same as this

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

How I can retrieve lines from file 2 which exactly match with column 1 and 2 of file 1 ?

Thanks

bash awk • 1.7k views

ADD COMMENT • link 5.5 years ago by Sam ▴ 150

score 4 · Answer 1 · 2018-10-25

4

Entering edit mode

5.5 years ago

Pierre Lindenbaum 161k

create a uniq key CHR_POS with sed and get the common lines with join. Something like (not tested)

join -t $'\t' -1 1 -2 1 \
   <(sed 's/\t/_/' file1.txt | sort -t $'\t' -k1,1) \
   <(sed 's/\t/_/' file2.txt | sort -t $'\t' -k1,1)  |\
sed 's/_/\t/'

ADD COMMENT • link 5.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

in the files some of the name are in this format

    scaffold_1005   6522 
    scaffold_1005   6565
    scaffold_1006   12174

but in out put file they change to :

  scaffold  1006_12174
    scaffold    1005_6565

which means name and position have been merged. how I can solve it ?

ADD REPLY • link 5.5 years ago by Sam ▴ 150

0

Entering edit mode

needless to say: use nother delimiter.. e.g: sed 's/\t/__________________________/'

ADD REPLY • link 5.5 years ago by Pierre Lindenbaum 161k

score 2 · Answer 2 · 2018-10-25

2

Entering edit mode

5.5 years ago

Benn 8.3k

You can use awk first to add a new column in the second file, where you paste chromosome + position. Then use awk followed by grep in pipe to select the rows of interest (and again awk to remove the new names column).

awk '{ $(NF+1)=$1$2 ; print }' file2 > file2a

cat file2a
Chr01 6789 SNP A T 15.17 90.91 6 18 6 Chr016789
Chr01 6795 SNP G T 12.11 81.82 6 17 4 Chr016795
Chr01 45965 SNP G C 26.33 100.00 6 21 6 Chr0145965

awk '{ $(NF+1)=$1$2 ; print($3) }' file1 | grep -f - -w file2a | awk '{$11="" ; print}'
Chr01 45965 SNP G C 26.33 100.00 6 21 6

ADD COMMENT • link 5.5 years ago by Benn 8.3k

1

Entering edit mode

'Chr01 4596' would match Chr01 4596111111111

ADD REPLY • link 5.5 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Thanks Pierre, I have now included -w in the grep statement.

ADD REPLY • link 5.5 years ago by Benn 8.3k

score 0 · Answer 3 · 2018-10-25

output with awk:

$ awk 'NR==FNR {a[$1,$2];next} ($1,$2) in a' file1.txt file2.txt

or

$  awk 'NR==FNR {a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt 

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

output with grep (with OP text):

$ grep -wf file1.txt file2.txt 

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

output with tsv-utils:

$ tsv-join -f file2.txt --key-fields 1,2 file1.txt --append-fields 3-10

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

Input:

$ tail -n+1 file1.txt  file2.txt 
==> file1.txt <==
Chromosome  Position    
Chr01   45943   
Chr01   45965
Chr01   45981   
Chr01   46122   
Chr02   45965

==> file2.txt <==
Chr01   6789    SNP A   T   15.17   90.91   6   18  6
Chr01   6795    SNP G   T   12.11   81.82   6   17  4
Chr01   45965   SNP G   C   26.33   100.00  6   21  6
Chr01   459656  SNP G   C   26.33   100.00  6   21  6
Chr01   4596111111111   SNP G   C   26.33   100.00  6   21  6