Hi, I have two large files, one is the list of SNPs (file1) and another its annotation file (file2). Please help me to write a code for the following analysis. I am looking to fetch matching data in two files if the data in column first and second are found to match and print the entire row of files (first and second).
I tried the following command but it just prints the data of the second file. However, I want to print matching data of both files (first and second).
awk -F'|' 'NR==FNR{c[$1,$2]++;next};c[$1,$2] > 0' file1.txt file2.txt >out.txt
For example:
File 1:
chr1    9133639 T   CMD
chr2    6134363 C   FFP
chr4    6344639 A   FFP
File 2:
chr1    9133639 T   GI_02334
chr2    6134363 C   GI_02338
chr4    6344639 A   GI_02365
chr1    7133739 A   GI_02339
chr2    5134763 C   GI_02389
chr4    4344639 T   GI_04365
Expected Output:
chr1    9133639 T   CMD chr1    9133639 T   GI_02334
chr2    6134363 C   FFP chr2    6134363 C   GI_02338
chr4    6344639 A   FFP chr4    6344639 A   GI_02365
                    
                
                
How large of files are we talking? And does it need to be performed in the shell?
My solution to this would be to import the files into R and combine the data together with a
left_join()by chr, position, and nucleotide.Otherwise you might be able to do what you want with the
joincommand, but it might take some extra work sincejoinrequires that you only join by one field and that the files are sorted by the key columnThank you for your reply. A total number 1500 of records in SNPs (file 1) and 593337 records are in annotation file (file2). It is not mandatory to be performed in shell script. However, if there is a shell or python/perl script, it would be best. Could you please elaborate the left_join() to use in R.