Match column contents of one file and pasted whole line if match found
2
0
Entering edit mode
6.4 years ago
waqasnayab ▴ 250

Hi,

I have two files.

bigfile1.txt

pos100  5   149275986   149275987   sample1
                                           1   NN
                                           15  NN
                                           18  NN
pos10   1   12907457    12907458    sample2
                                           1   NN
                                           2   NN
                                           3   NN
                                           18  NN
                                           191 NN
pos100  5   149275986   149275987   sample3
                                           17  NN
                                           15  NN
                                           18  NN

lookupfile2.txt

1269554 CA
1372268 AT
1849529 GC
149275986   TT
11854085    AG
11854476    GT
12837371    AA
12907457    TC
12907379    CA
12907457    GA

I want if first content of column 3 (149275986) of bigfile1.txt matches within the lookupfile2.txt, than the whole line is pasted in bigfile1.txt . Like:

pos100  5   149275986   149275987   sample1        149275986    TT
                                           1   NN
                                           15  NN
                                           18  NN
pos10   1   12907457    12907458    sample2        12907457     TC
                                           1   NN
                                           2   NN
                                           3   NN
                                           18  NN
                                           191 NN
pos100  5   149275986   149275987   sample3        149275986    TT
                                           17  NN
                                           15  NN
                                           18  NN

Due to the format I have for bigfile1.txt (it couldn't be sorted), I am getting difficulty in joining the things. I tired: https://stackoverflow.com/questions/18592736/awk-compare-columns-from-two-files-impute-values-of-another-column

and several other but no luck.

Thanks,

Waqas.

SNP genome next-gen sequence • 1.3k views
ADD COMMENT
3
Entering edit mode
6.4 years ago

You can try this - see below. Note that you have a duplicate variant position (12907457). I decided to merge these together first, as you can see.

awk 'BEGIN{FS=" "}; {a[$1]=a[$1]==""? $2 : a[$1] "," $2 } END {for (i in a) print i, a[i]}' lookupfile2.txt > lookupfile2.VarsMerged.txt
cat lookupfile2.VarsMerged.txt
12907379 CA
12907457 TC,GA
12837371 AA
11854476 GT
149275986 TT
1269554 CA
1372268 AT
1849529 GC
11854085 AG

awk 'BEGIN {FS=" "} FNR==NR {values[NR]=$0; next} header=0; /^pos/{start=$3; end=$4; header=1} {for (i in values) {split(values[i],a," "); if (header==1 && start<=a[1] && end>=a[1]) {print $0"\t"values[i]}; if (header==0 && start<=a[1] && end>=a[1]) {print $0}}}' lookupfile2.VarsMerged.txt bigfile1.txt
pos100  5   149275986   149275987   sample1 149275986 TT
                                           1   NN
                                           15  NN
                                           18  NN
pos10   1   12907457    12907458    sample2 12907457 TC,GA
                                           1   NN
                                           2   NN
                                           3   NN
                                           18  NN
                                           191 NN
pos100  5   149275986   149275987   sample3 149275986 TT
                                           17  NN
                                           15  NN
                                           18  NN
ADD COMMENT
0
Entering edit mode

Hi Kevin,

Yeah, nice observation, I found four others positions within my lookup file that are repeated two times:

  2 23103039
  2 23103355
  2 38092026
  2 73717720

Further, I cross-checked with your first command, all fine, and than I run the last command. Seems to me it worked..,,,!!!! Thanks Kevin, once again...,,,!!!!!

ADD REPLY
0
Entering edit mode

Hi Waqas, absolutely no problem.

ADD REPLY
0
Entering edit mode
6.4 years ago

If you are okay with coordinates not being present twice:

$ sed '/pos/ s/\s\+/\t/g' file.txt | csvjoin -t  -c "3,1" --left - ids.txt | sed 's/,/\t/g' | datamash  -g 1 unique 2,3,4,5,6
pos100  5   149275986   149275987   sample1 CA
                                           1   NN                   
                                           15  NN                   
                                           18  NN                   
pos10   1   12907457    12907458    sample2 GA,TC
                                           1   NN                   
                                           2   NN                   
                                           3   NN                   
                                           18  NN                   
                                           191 NN                   
pos100  5   149275986   149275987   sample3 TT
                                           17  NN                   
                                           15  NN                   
                                           18  NN   

$ cat ids.txt 
1269554 CA
1372268 AT
1849529 GC
149275986   TT
11854085    AG
11854476    GT
12837371    AA
12907457    TC
12907379    CA
12907457    GA

$ cat file.txt 
pos100  5   149275986   149275987   sample1
                                           1   NN
                                           15  NN
                                           18  NN
pos10   1   12907457    12907458    sample2
                                           1   NN
                                           2   NN
                                           3   NN
                                           18  NN
                                           191 NN
pos100  5   149275986   149275987   sample3
                                           17  NN
                                           15  NN
                                           18  NN
ADD COMMENT

Login before adding your answer.

Traffic: 1951 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6