Question

Match columns of two different files and paste a value of corrsponding match existed in another column

0

Entering edit mode

7.2 years ago

waqasnayab ▴ 250

Hi,

I have two files:

file1:
chr1    873251  0.0620
chr1    930248  0.9210
chr1    930939  0.2110
chr1    931131  0.6050

file2: 
chr1    873251  rs11240779      G       A       4047.9  6       0.75    8       171     23.81   1.071   53.55   ncRNA_intronic  0       FA
chr1    930248  rs41285790      G       A       2545.9  2       0.25    8       535     12.67   4.575   54.53   exonic  nonsynonymous_SNV
chr1    930939  rs9988021       G       A       4783.35 8       1       8       155     31.06   0       56.55   intronic        0       SA
chr1    931131  rs375757231     C       CCCCT   5151.25 6       0.75    8       168     34.13   77.892  54.05   intronic        0       SA

I want to match column 3 of file1 with column 2 of file 2 (search the entire of column 2 of file 2), and if column matches paste the value of column 4 from file 1 into file2 as new column 3. PS: column lengths are different. file 1 has 107011 values in each column, and file 2 has 107113 values in each column.

So, that my expected output would be:

chr1    873251  0.0620 rs11240779      G       A       4047.9  6       0.75    8       171     23.81   1.071   53.55   ncRNA_intronic  0       FA
    chr1    930248  0.9210 rs41285790      G       A       2545.9  2       0.25    8       535     12.67   4.575   54.53   exonic  nonsynonymous_SNV
    chr1    930939  0.2110 rs9988021       G       A       4783.35 8       1       8       155     31.06   0       56.55   intronic        0       SA
    chr1    931131  0.6050 rs375757231     C       CCCCT   5151.25 6       0.75    8       168     34.13   77.892  54.05   intronic        0       SA

I tired my luck a lot with awk but no success.

Best,

Waqas.

genome next-gen sequencing • 3.0k views

ADD COMMENT • link updated 7.2 years ago by Alex Reynolds 35k • written 7.2 years ago by waqasnayab ▴ 250

0

Entering edit mode

What exactly did you try with awk ?

ADD REPLY • link 7.2 years ago by NB ▴ 960

score 3 · Accepted Answer · 2017-02-08

Here's a solution with BEDOPS sort-bed and bedmap, along with a couple Unix tools:

Convert the first file to a sorted BED file with awk and sort-bed:

$ awk '{ print $1"\t"$2"\t"($2+1)"\t"$3; }' file1.txt | sort-bed - > file1.bed

Do the same with the second file:

$ awk '{ print $1"\t"$2"\t"($2+1)"\t"substr($0, index($0,$3)); }' file2.txt | sort-bed - > file2.bed

Map the second file to the first with bedmap, and pipe the results of that to cut to filter out the columns you want:

$ bedmap --delim '\t' --exact --echo --echo-map --skip-unmapped file1.bed file2.bed | cut -f1,2,8- > answer.txt

The file answer.txt should follow the specification of your example's expected result.