Using AWK to perform VLOOKUP-like command
3
3
Entering edit mode
9.2 years ago
sagi.polani ▴ 100

I have a set of files with genotypic data, each divided into 3 columns of data, including: MARKER, ID, GENOTYPE.

I would like to use AWK (without changing the order/sorting of the files) to perform a VLOOKUP-like command in order to join the data within the files into a single file as follows:

File1:

BIEC2-99962 HOR_233 G_G
BIEC2-9997 HOR_233 A_G
BIEC2-999748 HOR_233 C_C
BIEC2-999848 HOR_233 G_G
BIEC2-99989 HOR_233 A_A

File2:

BIEC2-9997 HOR_250 A_A
BIEC2-999748 HOR_250 C_C
BIEC2-99989 HOR_250 A_C

File3:

BIEC2-9997 HOR_615 A_G
BIEC2-999748 HOR_615 A_C
BIEC2-999848 HOR_615 A_G
BIEC2-99989 HOR_615 A_C

Expected result:

BIEC2-99962 G_G NA NA   
BIEC2-9997 A_G A_A A_G
BIEC2-999748 C_C C_C A_C
BIEC2-999848 G_G NA A_G
BIEC2-99989 A_A A_C A_C

I would appreciate any help on this.

Thanks!

next-gen • 11k views
ADD COMMENT
2
Entering edit mode

You want to do it ONLY in awk or any script would be ok ?

ADD REPLY
1
Entering edit mode

Hello sagi.polani!

It appears that your post has been cross-posted to another site: SEQanswers: http://seqanswers.com/forums/showthread.php?p=160574#post160574

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

While you can process multiple input files at once in awk, it's not something that comes highly recommended. I would really recommend writing a short python or perl script. That's inevitably easier to debug.

ADD REPLY
0
Entering edit mode

Hi Ryan,

Preferably awk...

Thanks!

ADD REPLY
0
Entering edit mode

What have you tried? The awk method for doing this is terribly inefficient and overly complicated. If this isn't something that you absolutely have to do, then take the hint and don't use awk for the task.

ADD REPLY
0
Entering edit mode

I prefer one-liners, but I'm open to suggestions.

Thanks

ADD REPLY
1
Entering edit mode

A one-liner solution would use join rather than awk.

ADD REPLY
0
Entering edit mode

Yes, but join requires me to sort the files, which I want to avoid doing.

ADD REPLY
0
Entering edit mode

Ah, I had assumed that you'd already done that. What have you tried so far with awk and what isn't yet working?

ADD REPLY
0
Entering edit mode

I went through numerous threads that I found, but non of them really worked. I'm not a real pro at this...

ADD REPLY
1
Entering edit mode

Define "not worked". It's unlikely that any of us will want to bother writing an awk-based solution for this. The most help you're likely to get is advise on a script that you've already started writing but isn't quite working.

ADD REPLY
2
Entering edit mode
9.2 years ago

I strongly recommend you against using awk for complex join operations. I would be able to tell you a million stories, about times when I trusted the output of a awk command, only to discover that it didn't run correctly because of some minor mistake or some exception.

This is how I would do it using R and the dplyr package:

> library(dplyr)
> bio1 = read.table('bio1')
> bio2 = read.table('bio2')
> bio3 = read.table('bio3')
> bio1 %>%
    left_join(
        bio2 %>% # join with the second file (only the first and third column)
            select(V1, V3), 
        by='V1') %>%     
    left_join(
        bio3 %>% 
            select(V1, V3), 
        by='V1') %>%
    mutate_each(
       funs(ifelse(is.na(.), 'NA', as.character(.))), starts_with('V3'))   # transform all NAs to the string NA
ADD COMMENT
0
Entering edit mode

Thanks Giovanni - it worked great!

ADD REPLY
2
Entering edit mode
9.2 years ago

Here is a Awk solution:

awk '{if (FNR == 1) f++ ; a[$1][f] = $3} END {for (i in a) {printf I ; for (j = 1 ; j <= f ; j++) {printf (a[i][j]) ? "\t"a[i][j] : "\tNA"} ; printf "\n"}}' file*

It stores the content of the different files in memory, building an array of arrays and using the file number (f) as column index. So beware if you are using large files.

ADD COMMENT
0
Entering edit mode

Hello, I am getting an odd error:

awk '{if (FNR == 1) f++ ; a[$1][f] = $3} END {for (i in a) {printf I ; for (j = 1 ; j <= f ; j++) {printf (a[i][j]) ? "\t"a[i][j] : "\tNA"} ; printf "\n"}}' GAGA_* > awk_result

awk: {if (FNR == 1) f++ ; a[$1][f] = $3} END {for (i in a) {printf I ; for (j = 1 ; j <= f ; j++) {printf (a[i][j]) ? "\t"a[i][j] : "\tNA"} ; printf "\n"}}
awk:                           ^ syntax error
awk: {if (FNR == 1) f++ ; a[$1][f] = $3} END {for (i in a) {printf I ; for (j = 1 ; j <= f ; j++) {printf (a[i][j]) ? "\t"a[i][j] : "\tNA"} ; printf "\n"}}
awk:                                                                                                           ^ syntax error
awk: {if (FNR == 1) f++ ; a[$1][f] = $3} END {for (i in a) {printf I ; for (j = 1 ; j <= f ; j++) {printf (a[i][j]) ? "\t"a[i][j] : "\tNA"} ; printf "\n"}}
awk:                                                                                                                ^ syntax error
awk: {if (FNR == 1) f++ ; a[$1][f] = $3} END {for (i in a) {printf I ; for (j = 1 ; j <= f ; j++) {printf (a[i][j]) ? "\t"a[i][j] : "\tNA"} ; printf "\n"}}
awk:                                                                                                                          ^ syntax error
ADD REPLY
0
Entering edit mode

You must be using an old version of Awk (are you using a Mac?). Array of arrays is a feature that was introduced 4 years ago in GNU Awk 4.0.

ADD REPLY
0
Entering edit mode
@BioPower3-IBM ~ $ awk -W version
GNU Awk 3.1.7
Copyright (C) 1989, 1991-2009 Free Software Foundation.

Looks like you are right. I'Il try to update.

ADD REPLY
1
Entering edit mode
9.2 years ago

You could probably modify this script for your input sets:

​#!/usr/bin/env python

input_one = '''
BIEC2-99962 HOR_233 G_G
BIEC2-9997 HOR_233 A_G
BIEC2-999748 HOR_233 C_C
BIEC2-999848 HOR_233 G_G
BIEC2-99989 HOR_233 A_A
'''.strip().split()

input_two = '''
BIEC2-9997 HOR_250 A_A
BIEC2-999748 HOR_250 C_C
BIEC2-99989 HOR_250 A_C
'''.strip().split()

input_three = '''
BIEC2-9997 HOR_615 A_G
BIEC2-999748 HOR_615 A_C
BIEC2-999848 HOR_615 A_G
BIEC2-99989 HOR_615 A_C
'''.strip().split()

one_list = input_one[::3]
one_dict = dict(zip(one_list, input_one[2::3]))
two_dict = dict(zip(input_two[::3], input_two[2::3]))
three_dict = dict(zip(input_three[::3], input_three[2::3]))

print '\n'.join([' '.join([k, one_dict[k], two_dict.get(k, 'NA'), three_dict.get(k, 'NA')]) for k in one_list])

The output looks like:

$ ./join_test.py
BIEC2-99962 G_G NA NA
BIEC2-9997 A_G A_A A_G
BIEC2-999748 C_C C_C A_C
BIEC2-999848 G_G NA A_G
BIEC2-99989 A_A A_C A_C

This seems to match your expected output.

If you want to understand how the script works, use some print statements for each variable before the list comprehension, and then break the list comprehension down into smaller pieces.

Ultimately, you would replace lists input_one, input_two and input_three with the results from reading in your input files with open() and readlines() methods.

Remember to strip() and split() so that each element of the list is separated from the others, regardless of whether the delimiter is a space or newline - use print to investigate one of the sample input lists, if this requirement isn't clear.

I'd second that awk is not really the ideal tool for this job, and I use it a great deal.

ADD COMMENT
0
Entering edit mode

Hello,

I am trying to replace input_one and input_two with opening of file, here is the error I get:

Traceback (most recent call last):
  File "test2.py", line 6, in <module>
    input_one = inputo.strip().split()
AttributeError: 'file' object has no attribute 'strip'

And this is the script I tried:

#!/usr/bin/env python

import sys

inputo = open(sys.argv[1], 'r')
input_one = inputo.strip().split()
input_two = open(sys.argv[2], 'r').strip().split()

one_list = input_one[::3]
one_dict = dict(zip(one_list, input_one[2::3]))
two_dict = dict(zip(input_two[::3], input_two[2::3]))
three_dict = dict(zip(input_three[::3], input_three[2::3]))

print '\n'.join([' '.join([k, one_dict[k], two_dict.get(k, 'NA'), three_dict.get(k, 'NA')]) for k in one_list])

Must be something wrong with splitting of course, Thank you for your help!

ADD REPLY
0
Entering edit mode

Read up on Python I/O and examples of how that works. You don't want to split on a file handle (and as the error shows, you can't). Hint: You need to read the entire file into a string before splitting, using the read() method on a file handle.

ADD REPLY

Login before adding your answer.

Traffic: 1972 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6