How to match 2 columns from 2 different tsv files?
1
0
Entering edit mode
24 months ago
blackadder ▴ 30

Hello!

In case this question is already answered I apologize in advance.

I have 2 very big tsv files with some values of one column from one file matching the values of one column from the other file. Based on that I want to make a new tsv file.

1st file

ERZ871266.fasta_contig1 unclassified (taxid 0)
ERZ871266.fasta_contig2 Aeromicrobium choanae (taxid 1736691)
ERZ871266.fasta_contig4 Clostridioides difficile (taxid 1496)
ERZ871266.fasta_contig6 unclassified (taxid 0)
........

2nd file

/home/results/ERZ940766.fasta  ERZ871266.fasta_contig1 
/home/results/ERZ940766.fasta  ERZ871266.fasta_contig2 
/home/results/ERZ940766.fasta  ERZ871266.fasta_contig3
/home/results/ERZ940766.fasta  ERZ871266.fasta_contig4 
/home/results/ERZ940766.fasta  ERZ871266.fasta_contig5
/home/results/ERZ940766.fasta  ERZ871266.fasta_contig6  
........

What I want to do is the following:

ERZ871266.fasta_contig1 unclassified (taxid 0) /home/results/ERZ940766.fasta 
ERZ871266.fasta_contig2 Aeromicrobium choanae (taxid 1736691) /home/results/ERZ940766.fasta 
ERZ871266.fasta_contig4 Clostridioides difficile (taxid 1496) /home/results/ERZ940766.fasta 
ERZ871266.fasta_contig6 unclassified (taxid 0) /home/results/ERZ940766.fasta 
........

Thanking you in advance!

bash unix pattern-matching • 1.5k views
ADD COMMENT
0
Entering edit mode

What have you tried? A simple search on Stack Overflow will reveal multiple ways of doing this.

ADD REPLY
0
Entering edit mode

Yes, you are right. I forgot to mention what i did.

I played mostly with the join command but I am getting an error saying that my files are not sorted (even though I used the sort command beforehand based on the specific columns). I also don't have duplicates in my files

ADD REPLY
0
Entering edit mode

Show us what you did as well as the exact error you face. join ... <(sort ... file1) <(sort ... file2) with the appropriate params should work.

ADD REPLY
0
Entering edit mode

Initially I tried:

join -t $'\t' -a 1 -a 2 -1 1 -2 2  file1 file2 > output

and I got the following error:

join: file1:11: is not sorted: ERZ871266.fasta_contig10 Serratia marcescens subsp. marcescens Db11 (taxid 273526)
join: file2:11: is not sorted: /home/results/ERZ940766.fasta    ERZ871266.fasta_contig33    
join: input is not in sorted order

Then I tried with the --nocheck-order

join -t $'\t' -a 1 -a 2 -1 1 -2 2  --nocheck-order file1 file2 > output

With that I got an output but the file had missing values. For example:

ERZ871266.fasta_contig1 unclassified (taxid 0) /home/results/ERZ940766.fasta 
ERZ871266.fasta_contig2 Aeromicrobium choanae (taxid 1736691) 
ERZ871266.fasta_contig4 Clostridioides difficile (taxid 1496) /home/results/ERZ940766.fasta 
ERZ871266.fasta_contig6 unclassified (taxid 0) /home/results/ERZ940766.fasta 
ERZ871266.fasta_contig7 unclassified (taxid 0) /home/results/ERZ940766.fasta 
ERZ871266.fasta_contig9 unclassified (taxid 0)

Maybe it has to do with the fact that not all values are matching?

ADD REPLY
0
Entering edit mode

you can use dplyr join function in R (e.g left_join( ))

ADD REPLY
0
Entering edit mode

Why do you recommend left_join when file2 seemingly has more values and if anything, inner_join should be preferred when criteria are unclear?

ADD REPLY
0
Entering edit mode

Edited..depends on how OP wants to join the data and retain columns.

ADD REPLY
3
Entering edit mode
24 months ago
Ram 43k

Either switch to R + merge/dplyr like Mamta says if you want to investigate why certain lines are not matching, or run tests based on known scenarios.

To start off, knock out the -a params and replace file1 and file2 with <(sort -k1,1 file1) and <(sort -k 2,2 file2). --nocheck-order doesn't negate the need for the content to be sorted, it just asks join to skip the check.

Note: Use the 101010 button to format as code. You're using the double quote button that quotes content - it doesn't format content in monospace + highlight keywords (which is what code formatting is).

code_formatting

ADD COMMENT
0
Entering edit mode
join -t $'\t' <(sort -k1,1 file1) <(sort -k2,2 file2) -1 1 -2 2 > output

This worked! Thank you very much !

ADD REPLY
1
Entering edit mode

I'll move it to an answer, please accept it to mark the post as solved.

ADD REPLY

Login before adding your answer.

Traffic: 2292 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6