Question

How to compare data in two files and print matching data of file1 and file2

0

Entering edit mode

4.5 years ago

Kumar ▴ 170

Hi, I have two large files, one is the list of SNPs (file1) and another its annotation file (file2). Please help me to write a code for the following analysis. I am looking to fetch matching data in two files if the data in column first and second are found to match and print the entire row of files (first and second).

I tried the following command but it just prints the data of the second file. However, I want to print matching data of both files (first and second).

awk -F'|' 'NR==FNR{c[$1,$2]++;next};c[$1,$2] > 0' file1.txt file2.txt >out.txt

For example:

File 1:

chr1    9133639 T   CMD
chr2    6134363 C   FFP
chr4    6344639 A   FFP

File 2:

chr1    9133639 T   GI_02334
chr2    6134363 C   GI_02338
chr4    6344639 A   GI_02365
chr1    7133739 A   GI_02339
chr2    5134763 C   GI_02389
chr4    4344639 T   GI_04365

Expected Output:

chr1    9133639 T   CMD chr1    9133639 T   GI_02334
chr2    6134363 C   FFP chr2    6134363 C   GI_02338
chr4    6344639 A   FFP chr4    6344639 A   GI_02365

Genome Match SNPs • 2.2k views

ADD COMMENT • link updated 4.5 years ago by GenoMax 154k • written 4.5 years ago by Kumar ▴ 170

0

Entering edit mode

How large of files are we talking? And does it need to be performed in the shell?
My solution to this would be to import the files into R and combine the data together with a left_join() by chr, position, and nucleotide.

Otherwise you might be able to do what you want with the join command, but it might take some extra work since join requires that you only join by one field and that the files are sorted by the key column

ADD REPLY • link 4.5 years ago by KH ▴ 100

0

Entering edit mode

Thank you for your reply. A total number 1500 of records in SNPs (file 1) and 593337 records are in annotation file (file2). It is not mandatory to be performed in shell script. However, if there is a shell or python/perl script, it would be best. Could you please elaborate the left_join() to use in R.

ADD REPLY • link 4.5 years ago by Kumar ▴ 170

score 0 · Answer 1 · 2021-05-17

0

Entering edit mode

4.5 years ago

Alex Reynolds 36k

Via Unix join:

% join -t $'\t' -11 -21 -o 1.2,1.3,1.4,2.2,2.3,2.4,2.5 <(awk -v FS="\t" -v OFS="\t" '{ print $1"_"$2, $0 }' A.txt | sort -k1) <(awk -v FS="\t" -v OFS="\t" '{ print $1"_"$2, $0 }' B.txt | sort -k1)          
chr1    9133639 T   chr1    9133639 T   GI_02334
chr2    6134363 C   chr2    6134363 C   GI_02338
chr4    6344639 A   chr4    6344639 A   GI_02365

This works by creating a unique identifier key from the chromosome and position of each file with awk, sorting on that key, and joining the two files on that key (-11 -21). The -o 1.2,1.3,1.4,2.2,2.3,2.4,2.5 option prints out non-key columns.

Using sorted data will add runtime, but it may be preferable to reading both files into associative arrays in awk (or dictionaries in Python, hash tables in Perl, etc.) as the latter approach can use a lot of memory, which could be troublesome if your second file is very large.

Here's a dictionary approach in Python, using a tuple of the first and second column values as a key:

#!/usr/bin/env python

import sys

f1_fn = "A.txt"
f2_fn = "B.txt"

f1 = {}
with open(f1_fn, 'r') as f1_fh:
    for l in f1_fh:
        e = l.rstrip().split('\t')
        f1[(e[0], e[1])] = e

f2 = {}
with open(f2_fn, 'r') as f2_fh:
    for l in f2_fh:
        e = l.rstrip().split('\t')
        f2[(e[0], e[1])] = e

# inner join
k1 = f1.keys()
k2 = f2.keys()
for k in k1:
    if k in k2:
        sys.stdout.write('{}\n'.format('\t'.join(f1[k] + f2[k])))

ADD COMMENT • link 4.5 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you for your help. Yes, it works. However, it prints three columns from file1 and four columns from file2. Can I increase the columns in print such as print four columns of file1 and six columns of file2.

ADD REPLY • link 4.5 years ago by Kumar ▴ 170

0

Entering edit mode

Sure, just edit -o to add the additional column indices you want from files one and two.

ADD REPLY • link 4.5 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you for the help!

ADD REPLY • link 4.5 years ago by Kumar ▴ 170

0

Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
upvote_bookmark_accept