Question: compare two file to retrieve common lines
1
gravatar for Sam
6 months ago by
Sam100
Sam100 wrote:

Dear ALL

I have two tab file:

file 1 

    Chromosome  Position    
    Chr01   45943   
    Chr01   45965
    Chr01   45981   
    Chr01   46122   
    Chr02   45965

 file 2
        Chr01   6789    SNP A   T   15.17   90.91   6   18  6
        Chr01   6795    SNP G   T   12.11   81.82   6   17  4
       Chr01    45965   SNP G   C   26.33   100.00  6   21  6

I need out put as same as this

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

How I can retrieve lines from file 2 which exactly match with column 1 and 2 of file 1 ?

Thanks

awk bash • 318 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by Sam100
4
gravatar for Pierre Lindenbaum
6 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

create a uniq key CHR_POS with sed and get the common lines with join. Something like (not tested)

join -t $'\t' -1 1 -2 1 \
   <(sed 's/\t/_/' file1.txt | sort -t $'\t' -k1,1) \
   <(sed 's/\t/_/' file2.txt | sort -t $'\t' -k1,1)  |\
sed 's/_/\t/'
ADD COMMENTlink written 6 months ago by Pierre Lindenbaum119k

in the files some of the name are in this format

    scaffold_1005   6522 
    scaffold_1005   6565
    scaffold_1006   12174

but in out put file they change to :

  scaffold  1006_12174
    scaffold    1005_6565

which means name and position have been merged. how I can solve it ?

ADD REPLYlink modified 6 months ago • written 6 months ago by Sam100

needless to say: use nother delimiter.. e.g: sed 's/\t/__________________________/'

ADD REPLYlink written 6 months ago by Pierre Lindenbaum119k
2
gravatar for Benn
6 months ago by
Benn6.6k
Netherlands
Benn6.6k wrote:

You can use awk first to add a new column in the second file, where you paste chromosome + position. Then use awk followed by grep in pipe to select the rows of interest (and again awk to remove the new names column).

awk '{ $(NF+1)=$1$2 ; print }' file2 > file2a

cat file2a
Chr01 6789 SNP A T 15.17 90.91 6 18 6 Chr016789
Chr01 6795 SNP G T 12.11 81.82 6 17 4 Chr016795
Chr01 45965 SNP G C 26.33 100.00 6 21 6 Chr0145965

awk '{ $(NF+1)=$1$2 ; print($3) }' file1 | grep -f - -w file2a | awk '{$11="" ; print}'
Chr01 45965 SNP G C 26.33 100.00 6 21 6
ADD COMMENTlink modified 6 months ago • written 6 months ago by Benn6.6k
1

'Chr01 4596' would match Chr01 4596111111111

ADD REPLYlink modified 6 months ago • written 6 months ago by Pierre Lindenbaum119k
1

Thanks Pierre, I have now included -w in the grep statement.

ADD REPLYlink written 6 months ago by Benn6.6k
0
gravatar for cpad0112
6 months ago by
cpad011211k
India
cpad011211k wrote:

output with awk:

$ awk 'NR==FNR {a[$1,$2];next} ($1,$2) in a' file1.txt file2.txt

or

$  awk 'NR==FNR {a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt 

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

output with grep (with OP text):

$ grep -wf file1.txt file2.txt 

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

output with tsv-utils:

$ tsv-join -f file2.txt --key-fields 1,2 file1.txt --append-fields 3-10

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

Input:

$ tail -n+1 file1.txt  file2.txt 
==> file1.txt <==
Chromosome  Position    
Chr01   45943   
Chr01   45965
Chr01   45981   
Chr01   46122   
Chr02   45965

==> file2.txt <==
Chr01   6789    SNP A   T   15.17   90.91   6   18  6
Chr01   6795    SNP G   T   12.11   81.82   6   17  4
Chr01   45965   SNP G   C   26.33   100.00  6   21  6
Chr01   459656  SNP G   C   26.33   100.00  6   21  6
Chr01   4596111111111   SNP G   C   26.33   100.00  6   21  6
ADD COMMENTlink modified 6 months ago • written 6 months ago by cpad011211k

Please check Pierre's comment to my answer, it messes with your awk and grep solution as well (only checked those two).

ADD REPLYlink modified 6 months ago • written 6 months ago by Benn6.6k
1

Thanks b.nota. Updated example input and grep and awk codes.

ADD REPLYlink modified 6 months ago • written 6 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1036 users visited in the last hour