Using AWK to perform VLOOKUP-like command
3
3
Entering edit mode
8.1 years ago
sagi.polani ▴ 100

I have a set of files with genotypic data, each divided into 3 columns of data, including: MARKER, ID, GENOTYPE.

I would like to use AWK (without changing the order/sorting of the files) to perform a VLOOKUP-like command in order to join the data within the files into a single file as follows:

File1:

BIEC2-99962 HOR_233 G_G
BIEC2-9997 HOR_233 A_G
BIEC2-999748 HOR_233 C_C
BIEC2-999848 HOR_233 G_G
BIEC2-99989 HOR_233 A_A

File2:

BIEC2-9997 HOR_250 A_A
BIEC2-999748 HOR_250 C_C
BIEC2-99989 HOR_250 A_C

File3:

BIEC2-9997 HOR_615 A_G
BIEC2-999748 HOR_615 A_C
BIEC2-999848 HOR_615 A_G
BIEC2-99989 HOR_615 A_C

Expected result:

BIEC2-99962 G_G NA NA
BIEC2-9997 A_G A_A A_G
BIEC2-999748 C_C C_C A_C
BIEC2-999848 G_G NA A_G
BIEC2-99989 A_A A_C A_C

I would appreciate any help on this.

Thanks!

next-gen • 10k views
2
Entering edit mode

You want to do it ONLY in awk or any script would be ok ?

1
Entering edit mode

Hello sagi.polani!

This is typically not recommended as it runs the risk of annoying people in both communities.

1
Entering edit mode
0
Entering edit mode

While you can process multiple input files at once in awk, it's not something that comes highly recommended. I would really recommend writing a short python or perl script. That's inevitably easier to debug.

0
Entering edit mode

Hi Ryan,

Preferably awk...

Thanks!

0
Entering edit mode

What have you tried? The awk method for doing this is terribly inefficient and overly complicated. If this isn't something that you absolutely have to do, then take the hint and don't use awk for the task.

0
Entering edit mode

I prefer one-liners, but I'm open to suggestions.

Thanks

1
Entering edit mode

A one-liner solution would use join rather than awk.

0
Entering edit mode

Yes, but join requires me to sort the files, which I want to avoid doing.

0
Entering edit mode

Ah, I had assumed that you'd already done that. What have you tried so far with awk and what isn't yet working?

0
Entering edit mode

I went through numerous threads that I found, but non of them really worked. I'm not a real pro at this...

1
Entering edit mode

Define "not worked". It's unlikely that any of us will want to bother writing an awk-based solution for this. The most help you're likely to get is advise on a script that you've already started writing but isn't quite working.

2
Entering edit mode
8.1 years ago

I strongly recommend you against using awk for complex join operations. I would be able to tell you a million stories, about times when I trusted the output of a awk command, only to discover that it didn't run correctly because of some minor mistake or some exception.

This is how I would do it using R and the dplyr package:

> library(dplyr)
> bio1 %>%
left_join(
bio2 %>% # join with the second file (only the first and third column)
select(V1, V3),
by='V1') %>%
left_join(
bio3 %>%
select(V1, V3),
by='V1') %>%
mutate_each(
funs(ifelse(is.na(.), 'NA', as.character(.))), starts_with('V3'))   # transform all NAs to the string NA

0
Entering edit mode

Thanks Giovanni - it worked great!

2
Entering edit mode
8.1 years ago

Here is a Awk solution:

awk '{if (FNR == 1) f++ ; a[$1][f] =$3} END {for (i in a) {printf I ; for (j = 1 ; j <= f ; j++) {printf (a[i][j]) ? "\t"a[i][j] : "\tNA"} ; printf "\n"}}' file*


It stores the content of the different files in memory, building an array of arrays and using the file number (f) as column index. So beware if you are using large files.

0
Entering edit mode

Hello, I am getting an odd error:

awk '{if (FNR == 1) f++ ; a[$1][f] =$3} END {for (i in a) {printf I ; for (j = 1 ; j <= f ; j++) {printf (a[i][j]) ? "\t"a[i][j] : "\tNA"} ; printf "\n"}}' GAGA_* > awk_result

awk: {if (FNR == 1) f++ ; a[$1][f] =$3} END {for (i in a) {printf I ; for (j = 1 ; j <= f ; j++) {printf (a[i][j]) ? "\t"a[i][j] : "\tNA"} ; printf "\n"}}
awk:                           ^ syntax error
awk: {if (FNR == 1) f++ ; a[$1][f] =$3} END {for (i in a) {printf I ; for (j = 1 ; j <= f ; j++) {printf (a[i][j]) ? "\t"a[i][j] : "\tNA"} ; printf "\n"}}
awk:                                                                                                           ^ syntax error
awk: {if (FNR == 1) f++ ; a[$1][f] =$3} END {for (i in a) {printf I ; for (j = 1 ; j <= f ; j++) {printf (a[i][j]) ? "\t"a[i][j] : "\tNA"} ; printf "\n"}}
awk:                                                                                                                ^ syntax error
awk: {if (FNR == 1) f++ ; a[$1][f] =$3} END {for (i in a) {printf I ; for (j = 1 ; j <= f ; j++) {printf (a[i][j]) ? "\t"a[i][j] : "\tNA"} ; printf "\n"}}
awk:                                                                                                                          ^ syntax error

0
Entering edit mode

You must be using an old version of Awk (are you using a Mac?). Array of arrays is a feature that was introduced 4 years ago in GNU Awk 4.0.

0
Entering edit mode
@BioPower3-IBM ~ $awk -W version GNU Awk 3.1.7 Copyright (C) 1989, 1991-2009 Free Software Foundation.  Looks like you are right. I'Il try to update. ADD REPLY 1 Entering edit mode 8.1 years ago You could probably modify this script for your input sets: ​#!/usr/bin/env python input_one = ''' BIEC2-99962 HOR_233 G_G BIEC2-9997 HOR_233 A_G BIEC2-999748 HOR_233 C_C BIEC2-999848 HOR_233 G_G BIEC2-99989 HOR_233 A_A '''.strip().split() input_two = ''' BIEC2-9997 HOR_250 A_A BIEC2-999748 HOR_250 C_C BIEC2-99989 HOR_250 A_C '''.strip().split() input_three = ''' BIEC2-9997 HOR_615 A_G BIEC2-999748 HOR_615 A_C BIEC2-999848 HOR_615 A_G BIEC2-99989 HOR_615 A_C '''.strip().split() one_list = input_one[::3] one_dict = dict(zip(one_list, input_one[2::3])) two_dict = dict(zip(input_two[::3], input_two[2::3])) three_dict = dict(zip(input_three[::3], input_three[2::3])) print '\n'.join([' '.join([k, one_dict[k], two_dict.get(k, 'NA'), three_dict.get(k, 'NA')]) for k in one_list])  The output looks like: $ ./join_test.py
BIEC2-99962 G_G NA NA
BIEC2-9997 A_G A_A A_G
BIEC2-999748 C_C C_C A_C
BIEC2-999848 G_G NA A_G
BIEC2-99989 A_A A_C A_C


This seems to match your expected output.

If you want to understand how the script works, use some print statements for each variable before the list comprehension, and then break the list comprehension down into smaller pieces.

Ultimately, you would replace lists input_one, input_two and input_three with the results from reading in your input files with open() and readlines() methods.

Remember to strip() and split() so that each element of the list is separated from the others, regardless of whether the delimiter is a space or newline - use print to investigate one of the sample input lists, if this requirement isn't clear.

I'd second that awk is not really the ideal tool for this job, and I use it a great deal.

0
Entering edit mode

Hello,

I am trying to replace input_one and input_two with opening of file, here is the error I get:

Traceback (most recent call last):
File "test2.py", line 6, in <module>
input_one = inputo.strip().split()
AttributeError: 'file' object has no attribute 'strip'


And this is the script I tried:

#!/usr/bin/env python

import sys

inputo = open(sys.argv[1], 'r')
input_one = inputo.strip().split()
input_two = open(sys.argv[2], 'r').strip().split()

one_list = input_one[::3]
one_dict = dict(zip(one_list, input_one[2::3]))
two_dict = dict(zip(input_two[::3], input_two[2::3]))
three_dict = dict(zip(input_three[::3], input_three[2::3]))

print '\n'.join([' '.join([k, one_dict[k], two_dict.get(k, 'NA'), three_dict.get(k, 'NA')]) for k in one_list])


Must be something wrong with splitting of course, Thank you for your help!

0
Entering edit mode

Read up on Python I/O and examples of how that works. You don't want to split on a file handle (and as the error shows, you can't). Hint: You need to read the entire file into a string before splitting, using the read() method on a file handle.