Question: Merge two files depending on multiple matching columns
0
gravatar for iya837
5 weeks ago by
iya8370
iya8370 wrote:

Hello,

Im trying to join two files depending on multiple matching columns. I want to compare columns 1,2,4,5 from file 1 with columns 1,2,4,5 from file 2 and then merge matching lines in file 3 with column 3 of file 1 and all columns from files 2. I've already tried several awk command. For example :

1)

awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1$2$4$5]=$3;next} $1$2$4$5 in a{print $0, a[$1$2$4$5]}' file2 file1 > file3

2)

awk 'NR==FNR {a[$1$2$4$5] = $3; next} $1$2$4$5 in a' file2 file1 >file3

but i'm getting empty output. Could anyone help me with this issue ? Thank you

An example of what my files look like

File 1 :

CHROM,POS,ID,REF,ALT
1,69270,rs201219564,A,G
1,69428,rs140739101,T,G
1,69496,rs150690004,G,A
1,69511,rs75062661,A,G

File 2 :

Chr,Start,End,Ref,Alt,SIFT_score,SIFT_converted_rankscore,SIFT_pred,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,PROVEAN_score,PROVEAN_converted_rankscore,PROVEAN_pred,CADD_raw,CADD_raw_rankscore,CADD_phred
1,69511,69511,A,C,0.031,0.450,D,0.002,0.090,B,0.001,0.040,B,-1.52,0.370,N,2.8350.402,21.5
1,69511,69511,A,G,0.652,0.048,T,0.0,0.026,B,0.0,0.013,B,1.54,0.007,N,-0.784,0.041,0.047
1,69511,69511,A,T,1.0,0.010,T,0.0,0.026,B,0.0,0.013,B,2.06,0.004,N,-1.639,0.014,0.002
1,69496,69496,G,A,0.404,0.103,T,0.401,0.331,B,0.061,0.254,B,-1.39,0.344,N,0.9700.187,10.49
1,69496,69496,G,C,0.348,0.124,T,0.988,0.604,D,0.865,0.597,P,-2.54,0.550,D,2.6520.380,20.5
1,69496,69496,G,T,0.177,0.222,T,0.999,0.764,D,0.96,0.686,D,-2.69,0.574,D,1.4030.232,12.80
1,69428,69428,T,A,0.878,0.026,T,0.981,0.577,D,0.899,0.620,P,-0.16,0.095,N,0.0240.096,2.821
1,69428,69428,T,C,0.001,0.784,D,0.995,0.657,D,0.969,0.703,D,-5.34,0.846,D,4.5330.607,24.3
1,69428,69428,T,G,0.013,0.538,D,0.999,0.764,D,0.984,0.745,D,-5.05,0.827,D,4.1000.550,23.7

Wanted Output :

ID,Chr,Start,End,Ref,Alt,SIFT_score,SIFT_converted_rankscore,SIFT_pred,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,PROVEAN_score,PROVEAN_converted_rankscore,PROVEAN_pred,CADD_raw,CADD_raw_rankscore,CADD_phred
rs140739101,1,69428,69428,T,G,0.013,0.538,D,0.999,0.764,D,0.984,0.745,D,-5.05,0.827,D,4.1000.550,23.7
rs150690004,1,69496,69496,G,A,0.404,0.103,T,0.401,0.331,B,0.061,0.254,B,-1.39,0.344,N,0.9700.187,10.49
rs75062661,1,69511,69511,A,G,0.652,0.048,T,0.0,0.026,B,0.0,0.013,B,1.54,0.007,N,-0.784,0.041,0.047
snps annotation • 152 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by iya8370

I'd recommend using Python or R, as writing an awk script to do this will mean you spend more time finding the right script than on solving the problem.

Once the task is complete, you can explore on other ways if you like, but this problem is sophisticated enough to warrant a better tool than awk.

ADD REPLYlink written 5 weeks ago by RamRS21k

Thank you for your answer. Indeed i can use R or python but the files are HUGE (> 10G) and im currently working on it via a distant computational cluster than i can access only by bash command.

ADD REPLYlink written 5 weeks ago by iya8370

Python or Perl (yes, Perl) would work OK. Just look for streaming approaches for File2, storing only File1 in memory.

ADD REPLYlink written 5 weeks ago by RamRS21k

For starters, you could repeat $2 from File1 so it more closely resembles File2. File1::$1,$2,$2,$4,$5 will be the same as File2::$1,$2,$3,$4,$5. Once that's done, you should be able to use join to get the content from both files together.

ADD REPLYlink written 5 weeks ago by RamRS21k

I'll give it a try. Thank you

ADD REPLYlink written 5 weeks ago by iya8370
2
gravatar for Pierre Lindenbaum
5 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

use linux join:

join -t ':' -1 1 -2 1  <(awk -F ',' '{print $1$2$4$5 ":" $0;}' file1.csv | sort -t ':' -k1,1 ) <(awk -F ',' '{print $1$2$4$5 ":" $0;}' file2.csv | sort -t ':' -k1,1 ) | cut -d ':' -f 2-

1,69428,rs140739101,T,G:1,69428,69428,T,G,0.013,0.538,D,0.999,0.764,D,0.984,0.745,D,-5.05,0.827,D,4.1000.550,23.7
1,69496,rs150690004,G,A:1,69496,69496,G,A,0.404,0.103,T,0.401,0.331,B,0.061,0.254,B,-1.39,0.344,N,0.9700.187,10.49
1,69511,rs75062661,A,G:1,69511,69511,A,G,0.652,0.048,T,0.0,0.026,B,0.0,0.013,B,1.54,0.007,N,-0.784,0.041,0.047

I leave the final selection of the columns and the header as an exercice.

ADD COMMENTlink written 5 weeks ago by Pierre Lindenbaum120k
0
gravatar for iya837
5 weeks ago by
iya8370
iya8370 wrote:

Thank You all for your answers. I finally solved my problem thanks to @steeldriver. It was due to line endings. For those who may have the same issue i put the command line below :

awk -F, 'BEGIN{RS="\r\n" ; OFS=FS} NR==FNR{a[$1 FS $2 FS $4 FS $5] = $0; next} {ind = $1 FS $2 FS $4 FS $5} ind in a {print $3, a[ind]}' file2 file1

RS="\r\n" : convert endings in Unix style line endings

separating the fields with FS i the associative array key string just guards against false matches; if you just concatenate fields you can't distinguish between "abcdef" and "abc""def"

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by iya8370

Or you could just run dos2unix on the file with the \r\n line endings to convert the file to Unix format with \n line endings.

ADD REPLYlink written 5 weeks ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 780 users visited in the last hour