Question: compare two file to retrieve common lines
1
gravatar for Sam
17 months ago by
Sam140
Sam140 wrote:

Dear ALL

I have two tab file:

file 1 

    Chromosome  Position    
    Chr01   45943   
    Chr01   45965
    Chr01   45981   
    Chr01   46122   
    Chr02   45965

 file 2
        Chr01   6789    SNP A   T   15.17   90.91   6   18  6
        Chr01   6795    SNP G   T   12.11   81.82   6   17  4
       Chr01    45965   SNP G   C   26.33   100.00  6   21  6

I need out put as same as this

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

How I can retrieve lines from file 2 which exactly match with column 1 and 2 of file 1 ?

Thanks

awk bash • 493 views
ADD COMMENTlink modified 17 months ago • written 17 months ago by Sam140
4
gravatar for Pierre Lindenbaum
17 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum127k wrote:

create a uniq key CHR_POS with sed and get the common lines with join. Something like (not tested)

join -t $'\t' -1 1 -2 1 \
   <(sed 's/\t/_/' file1.txt | sort -t $'\t' -k1,1) \
   <(sed 's/\t/_/' file2.txt | sort -t $'\t' -k1,1)  |\
sed 's/_/\t/'
ADD COMMENTlink written 17 months ago by Pierre Lindenbaum127k

in the files some of the name are in this format

    scaffold_1005   6522 
    scaffold_1005   6565
    scaffold_1006   12174

but in out put file they change to :

  scaffold  1006_12174
    scaffold    1005_6565

which means name and position have been merged. how I can solve it ?

ADD REPLYlink modified 17 months ago • written 17 months ago by Sam140

needless to say: use nother delimiter.. e.g: sed 's/\t/__________________________/'

ADD REPLYlink written 17 months ago by Pierre Lindenbaum127k
2
gravatar for Benn
17 months ago by
Benn7.9k
Netherlands
Benn7.9k wrote:

You can use awk first to add a new column in the second file, where you paste chromosome + position. Then use awk followed by grep in pipe to select the rows of interest (and again awk to remove the new names column).

awk '{ $(NF+1)=$1$2 ; print }' file2 > file2a

cat file2a
Chr01 6789 SNP A T 15.17 90.91 6 18 6 Chr016789
Chr01 6795 SNP G T 12.11 81.82 6 17 4 Chr016795
Chr01 45965 SNP G C 26.33 100.00 6 21 6 Chr0145965

awk '{ $(NF+1)=$1$2 ; print($3) }' file1 | grep -f - -w file2a | awk '{$11="" ; print}'
Chr01 45965 SNP G C 26.33 100.00 6 21 6
ADD COMMENTlink modified 17 months ago • written 17 months ago by Benn7.9k
1

'Chr01 4596' would match Chr01 4596111111111

ADD REPLYlink modified 17 months ago • written 17 months ago by Pierre Lindenbaum127k
1

Thanks Pierre, I have now included -w in the grep statement.

ADD REPLYlink written 17 months ago by Benn7.9k
0
gravatar for cpad0112
17 months ago by
cpad011212k
India
cpad011212k wrote:

output with awk:

$ awk 'NR==FNR {a[$1,$2];next} ($1,$2) in a' file1.txt file2.txt

or

$  awk 'NR==FNR {a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt 

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

output with grep (with OP text):

$ grep -wf file1.txt file2.txt 

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

output with tsv-utils:

$ tsv-join -f file2.txt --key-fields 1,2 file1.txt --append-fields 3-10

Chr01   45965   SNP G   C   26.33   100.00  6   21  6

Input:

$ tail -n+1 file1.txt  file2.txt 
==> file1.txt <==
Chromosome  Position    
Chr01   45943   
Chr01   45965
Chr01   45981   
Chr01   46122   
Chr02   45965

==> file2.txt <==
Chr01   6789    SNP A   T   15.17   90.91   6   18  6
Chr01   6795    SNP G   T   12.11   81.82   6   17  4
Chr01   45965   SNP G   C   26.33   100.00  6   21  6
Chr01   459656  SNP G   C   26.33   100.00  6   21  6
Chr01   4596111111111   SNP G   C   26.33   100.00  6   21  6
ADD COMMENTlink modified 17 months ago • written 17 months ago by cpad011212k

Please check Pierre's comment to my answer, it messes with your awk and grep solution as well (only checked those two).

ADD REPLYlink modified 17 months ago • written 17 months ago by Benn7.9k
1

Thanks b.nota. Updated example input and grep and awk codes.

ADD REPLYlink modified 17 months ago • written 17 months ago by cpad011212k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1611 users visited in the last hour