compare two file to retrieve common lines
3
1
Entering edit mode
5.5 years ago
Sam ▴ 150

Dear ALL

I have two tab file:

file 1 

    Chromosome  Position    
    Chr01   45943   
    Chr01   45965
    Chr01   45981   
    Chr01   46122   
    Chr02   45965

 file 2
        Chr01   6789    SNP A   T   15.17   90.91   6   18  6
        Chr01   6795    SNP G   T   12.11   81.82   6   17  4
       Chr01    45965   SNP G   C   26.33   100.00  6   21  6

I need out put as same as this

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

How I can retrieve lines from file 2 which exactly match with column 1 and 2 of file 1 ?

Thanks

bash awk • 1.7k views
ADD COMMENT
4
Entering edit mode
5.5 years ago

create a uniq key CHR_POS with sed and get the common lines with join. Something like (not tested)

join -t $'\t' -1 1 -2 1 \
   <(sed 's/\t/_/' file1.txt | sort -t $'\t' -k1,1) \
   <(sed 's/\t/_/' file2.txt | sort -t $'\t' -k1,1)  |\
sed 's/_/\t/'
ADD COMMENT
0
Entering edit mode

in the files some of the name are in this format

    scaffold_1005   6522 
    scaffold_1005   6565
    scaffold_1006   12174

but in out put file they change to :

  scaffold  1006_12174
    scaffold    1005_6565

which means name and position have been merged. how I can solve it ?

ADD REPLY
0
Entering edit mode

needless to say: use nother delimiter.. e.g: sed 's/\t/__________________________/'

ADD REPLY
2
Entering edit mode
5.5 years ago
Benn 8.3k

You can use awk first to add a new column in the second file, where you paste chromosome + position. Then use awk followed by grep in pipe to select the rows of interest (and again awk to remove the new names column).

awk '{ $(NF+1)=$1$2 ; print }' file2 > file2a

cat file2a
Chr01 6789 SNP A T 15.17 90.91 6 18 6 Chr016789
Chr01 6795 SNP G T 12.11 81.82 6 17 4 Chr016795
Chr01 45965 SNP G C 26.33 100.00 6 21 6 Chr0145965

awk '{ $(NF+1)=$1$2 ; print($3) }' file1 | grep -f - -w file2a | awk '{$11="" ; print}'
Chr01 45965 SNP G C 26.33 100.00 6 21 6
ADD COMMENT
1
Entering edit mode

'Chr01 4596' would match Chr01 4596111111111

ADD REPLY
1
Entering edit mode

Thanks Pierre, I have now included -w in the grep statement.

ADD REPLY
0
Entering edit mode
5.5 years ago

output with awk:

$ awk 'NR==FNR {a[$1,$2];next} ($1,$2) in a' file1.txt file2.txt

or

$  awk 'NR==FNR {a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt 

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

output with grep (with OP text):

$ grep -wf file1.txt file2.txt 

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

output with tsv-utils:

$ tsv-join -f file2.txt --key-fields 1,2 file1.txt --append-fields 3-10

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

Input:

$ tail -n+1 file1.txt  file2.txt 
==> file1.txt <==
Chromosome  Position    
Chr01   45943   
Chr01   45965
Chr01   45981   
Chr01   46122   
Chr02   45965

==> file2.txt <==
Chr01   6789    SNP A   T   15.17   90.91   6   18  6
Chr01   6795    SNP G   T   12.11   81.82   6   17  4
Chr01   45965   SNP G   C   26.33   100.00  6   21  6
Chr01   459656  SNP G   C   26.33   100.00  6   21  6
Chr01   4596111111111   SNP G   C   26.33   100.00  6   21  6
ADD COMMENT
0
Entering edit mode

Please check Pierre's comment to my answer, it messes with your awk and grep solution as well (only checked those two).

ADD REPLY
1
Entering edit mode

Thanks b.nota. Updated example input and grep and awk codes.

ADD REPLY

Login before adding your answer.

Traffic: 2047 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6