Question

How to add column to a file from from another file having one common column but different number of rows

0

Entering edit mode

6.5 years ago

AP ▴ 80

Hello everyone,

I have File 1 with 6 columns:

**geneId** "\t" **scaffold** "\t"  **LocusID** "\t" **pfam** "\t" **PfamID** "\t" **function**

1     "\t"     aa   "\t"  XL_012 "\t" pfam "\t" P1234 "\t" .............  

2      "\t"  bb "\t" XL_013 "\t" pfam "\t" P1324"\t ...........

3  "\t"  cc "\t" XL_014 "\t" pfam "\t" P1624"\t ...........

3"\t"  cc "\t" XL_014 "\t" pfam "\t" P1326"\t ...........

4"\t"  dd "\t" XL_015 "\t" pfam "\t" P1326"\t ...........

4"\t"  dd "\t" XL_015 "\t" pfam "\t" P1326"\t ...........

4  "\t"  dd "\t" XL_015 "\t" pfam "\t" P1326"\t ...........  

5 "\t" ee"\t" XL_016

File 2 with 2 columns

**LocusID** "\t" **Expression**

XL_012  "\t"  23

XL_013 "\t"  20

XL_014 "\t"  15

XL_015 "\t"  10

XL_016 "\t"  9

Desired output : I want to merge column expression to file1 based on locus ID:

**geneId** "\t" **scaffold** "\t"  **LocusID** "\t" **Expression** "\t" **pfam** "\t" **PfamID** "\t" **function**

1     "\t"     aa   "\t"  XL_012 "\t"  23 "\t"  pfam "\t" P1234 "\t" .............  

2      "\t"  bb "\t" XL_013 "\t" 20 "\t" pfam "\t" P1324"\t ...........

3  "\t"  cc "\t" XL_014 "\t" 15 "\t" pfam "\t" P1624"\t ...........

3"\t"  cc "\t" XL_014 "\t" 15 "\t"pfam "\t" P1326"\t ...........

4"\t"  dd "\t" XL_015 "\t" 10 "\t"pfam "\t" P1326"\t ...........

4"\t"  dd "\t" XL_015 "\t" 10 "\t"pfam "\t" P1326"\t ...........

4  "\t"  dd "\t" XL_015 "\t"  10 "\t" pfam "\t" P1326"\t ...........  

5 "\t" ee"\t" XL_016 "\t"9

I could do this easily if I had same number of rows for both files but here since I have different number of rows and some repetitions of locus ID due to many pfam domains for single gene I am super confused. Please help me figure out this one using awk, perl or bash. Many thanks in Advance!!!!

Thank you,

Ambika

awk text_processing RNA-Seq • 2.7k views

ADD COMMENT • link updated 6.5 years ago by cpad0112 21k • written 6.5 years ago by AP ▴ 80

0

Entering edit mode

Man! Where does this come from? I am eager to know ;)

ADD REPLY • link 6.5 years ago by lakhujanivijay 5.8k

score 1 · Answer 1 · 2017-11-17

1

Entering edit mode

6.5 years ago

Kevin Blighe 87k

I wasn't sure of the purpose of all of the tab-spaces and quotation marks were.. so I first removed them with sed commands.

sed 's/ \+//g' file1.txt | sed 's/"\\t"/\t/g' | sed 's/"\\t//g' | sed '/^\s*$/d' | sed 's/[\\.]*$//g' | cut -f1-5 > file1v2.txt 
sed 's/ \+//g' file2.txt | sed 's/"\\t"/\t/g' | sed '/^\s*$/d' > file2v2.txt

cat file1v2.txt
**geneId**  **scaffold**    **LocusID** **pfam**    **PfamID**
1   aa  XL_012  pfam    P1234
2   bb  XL_013  pfam    P1324
3   cc  XL_014  pfam    P1624
3   cc  XL_014  pfam    P1326
4   dd  XL_015  pfam    P1326
4   dd  XL_015  pfam    P1326
4   dd  XL_015  pfam    P1326
5   ee  XL_016

cat file2v2.txt
**LocusID** **Expression**
XL_012  23
XL_013  20
XL_014  15
XL_015  10
XL_016  9


awk 'BEGIN {FS="\t"} FNR==NR {key=$3; arrayLookup[key]=$0; next} {key=$1; if (arrayLookup[key]) print $2"\t"arrayLookup[key]}' file1v2.txt file2v2.txt 
**Expression**  **geneId**  **scaffold**    **LocusID** **pfam**    **PfamID**
23              1           aa              XL_012      pfam        P1234
20              2           bb              XL_013      pfam        P1324
15              3           cc              XL_014      pfam        P1326
10              4           dd              XL_015      pfam        P1326
9               5           ee              XL_016

ADD COMMENT • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hello Kevin, I tried your codes however it is not giving me any output. Does it make any difference if I don't have first heading lines?

ADD REPLY • link 6.5 years ago by AP ▴ 80

0

Entering edit mode

Thank you so much kevin, It worked now but I only problem I have is number of rows for output file, which is less than file 1. I want the same number of rows as file1 because some of pfam domains are unique even though they are from same loci.

Please help and many thanks.

ADD REPLY • link 6.5 years ago by AP ▴ 80

1

Entering edit mode

Oh, I see! Try this (it may require another modification):

awk 'BEGIN {FS="\t"} FNR==NR {key=$1; arrayLookup[key]=$0; next} {key=$3; if (arrayLookup[key]) print arrayLookup[key]"\t"$0}' file2v2.txt file1v2.txt
**LocusID** **Expression**  **geneId**  **scaffold**    **LocusID** **pfam**    **PfamID**
XL_012      23              1           aa              XL_012      pfam        P1234
XL_013      20              2           bb              XL_013      pfam        P1324
XL_014      15              3           cc              XL_014      pfam        P1624
XL_014      15              3           cc              XL_014      pfam        P1326
XL_015      10              4           dd              XL_015      pfam        P1326
XL_015      10              4           dd              XL_015      pfam        P1326
XL_015      10              4           dd              XL_015      pfam        P1326
XL_016       9              5           ee              XL_016

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you so much for your help Kevin!!!!

ADD REPLY • link 6.5 years ago by AP ▴ 80

score 0 · Answer 2 · 2017-11-17

0

Entering edit mode

6.5 years ago

venu 7.1k

Not a bioinformatics question. You can try the following

join <(cat file1.txt | awk -F'\t' '{print $2 "\t" $0}' | sort) <(cat file2.txt | sort) | tr ' ' '\t' | cut -f1 --complement

P.S. I didn't test it as your example is very hard to reproduce. I guess, you need to select columns properly with cut command

ADD COMMENT • link 6.5 years ago by venu 7.1k

0

Entering edit mode

I tried but this is not giving me any output and also I want the lines to be in same order.

ADD REPLY • link 6.5 years ago by AP ▴ 80

0

Entering edit mode

post the out put of following commands

head file1.txt file2.txt

ADD REPLY • link 6.5 years ago by venu 7.1k

0

Entering edit mode

head file1.txt:

XL_012 "\t" 23

XL_013 "\t" 20

XL_014 "\t" 15

XL_015 "\t" 10

XL_016 "\t" 9

head file2.txt:

1 "\t" aa "\t" XL_012 "\t" 23 "\t" pfam "\t" P1234 "\t" .............

2 "\t" bb "\t" XL_013 "\t" 20 "\t" pfam "\t" P1324"\t ...........

3 "\t" cc "\t" XL_014 "\t" 15 "\t" pfam "\t" P1624"\t ...........

3"\t" cc "\t" XL_014 "\t" 15 "\t"pfam "\t" P1326"\t ...........

4"\t" dd "\t" XL_015 "\t" 10 "\t"pfam "\t" P1326"\t ...........

ADD REPLY • link 6.5 years ago by AP ▴ 80

0

Entering edit mode

What kind of a format is this? "\t" should mean a tab separated file but your file contains \t' instead oftab spaces`. Please try my answer on a tab separated file. It should work.

ADD REPLY • link 6.5 years ago by venu 7.1k

0

Entering edit mode

@venu It is a tab separated file. I wrote it like that so it would be easy. I will try and keep you updated. Thanks

ADD REPLY • link 6.5 years ago by AP ▴ 80

0

Entering edit mode

You don't need to add anything extra while posting some example data. Just paste few lines as it is and just mention it is a tab / space separated file.

ADD REPLY • link 6.5 years ago by venu 7.1k

0

Entering edit mode

If you are familiar with R, the table joining functions in the dplyr package may be helpful.

ADD REPLY • link 6.5 years ago by mastal511 ★ 2.1k

0

Entering edit mode

@mastal511, I am not an expert but I know little bit of it. Could you suggest what kind of scripts I can use?

ADD REPLY • link 6.5 years ago by AP ▴ 80

0

Entering edit mode

Have a look at this tutorial for the join functions:

http://stat545.com/bit001_dplyr-cheatsheet.html

ADD REPLY • link 6.5 years ago by mastal511 ★ 2.1k

score 0 · Answer 3 · 2017-11-18

output (remove uniq if output requires duplicated lines):

 $ join -t$'\t' -o 1.1,1.2,1.3,2.2,1.4,1.5  --header -1 3 -2 1 <(sort -k3 test1.txt|uniq -u) <(sort -k1 test2.txt) 

geneId  scaffold    LocusID Expression  pfam    PfamID
1   aa  XL_012  23  pfam    P1234
2   bb  XL_013  20  pfam    P1324
3   cc  XL_014  15  pfam    P1326
3   cc  XL_014  15  pfam    P1624
4   dd  XL_015  10  pfam    P1326

better output:

$ join -t$'\t' -o 1.2,1.3,1.1,2.2,1.4,1.5  --header -1 1 -2 1 <(datamash -g 3 unique 1,2,4,5 < test1.txt) <(sort -k1 test2.txt)

geneId  scaffold    LocusID Expression  pfam    PfamID
1   aa  XL_012  23  pfam    P1234
2   bb  XL_013  20  pfam    P1324
3   cc  XL_014  15  pfam    P1326,P1624
4   dd  XL_015  10  pfam    P1326

After cleaning input:

$ cat test1.txt 
geneId  scaffold    LocusID pfam    PfamID  
1   aa  XL_012  pfam    P1234   
2   bb  XL_013  pfam    P1324   
3   cc  XL_014  pfam    P1624   
3   cc  XL_014  pfam    P1326   
4   dd  XL_015  pfam    P1326   
4   dd  XL_015  pfam    P1326   
4   dd  XL_015  pfam    P1326

$ cat test2.txt 
cusID   Expression  
XL_012  23
XL_013  20
XL_014  15
XL_015  10
XL_016  9