Question: How to add column to a file from from another file having one common column but different number of rows
0
gravatar for Ambika
23 days ago by
Ambika20
United States/Auburn/Auburn University
Ambika20 wrote:

Hello everyone,

I have File 1 with 6 columns:

**geneId** "\t" **scaffold** "\t"  **LocusID** "\t" **pfam** "\t" **PfamID** "\t" **function**

1     "\t"     aa   "\t"  XL_012 "\t" pfam "\t" P1234 "\t" .............  

2      "\t"  bb "\t" XL_013 "\t" pfam "\t" P1324"\t ...........

3  "\t"  cc "\t" XL_014 "\t" pfam "\t" P1624"\t ...........

3"\t"  cc "\t" XL_014 "\t" pfam "\t" P1326"\t ...........

4"\t"  dd "\t" XL_015 "\t" pfam "\t" P1326"\t ...........

4"\t"  dd "\t" XL_015 "\t" pfam "\t" P1326"\t ...........

4  "\t"  dd "\t" XL_015 "\t" pfam "\t" P1326"\t ...........  

5 "\t" ee"\t" XL_016

File 2 with 2 columns

**LocusID** "\t" **Expression**

XL_012  "\t"  23

XL_013 "\t"  20

XL_014 "\t"  15

XL_015 "\t"  10

XL_016 "\t"  9

Desired output : I want to merge column expression to file1 based on locus ID:

**geneId** "\t" **scaffold** "\t"  **LocusID** "\t" **Expression** "\t" **pfam** "\t" **PfamID** "\t" **function**

1     "\t"     aa   "\t"  XL_012 "\t"  23 "\t"  pfam "\t" P1234 "\t" .............  

2      "\t"  bb "\t" XL_013 "\t" 20 "\t" pfam "\t" P1324"\t ...........

3  "\t"  cc "\t" XL_014 "\t" 15 "\t" pfam "\t" P1624"\t ...........

3"\t"  cc "\t" XL_014 "\t" 15 "\t"pfam "\t" P1326"\t ...........

4"\t"  dd "\t" XL_015 "\t" 10 "\t"pfam "\t" P1326"\t ...........

4"\t"  dd "\t" XL_015 "\t" 10 "\t"pfam "\t" P1326"\t ...........

4  "\t"  dd "\t" XL_015 "\t"  10 "\t" pfam "\t" P1326"\t ...........  

5 "\t" ee"\t" XL_016 "\t"9

I could do this easily if I had same number of rows for both files but here since I have different number of rows and some repetitions of locus ID due to many pfam domains for single gene I am super confused. Please help me figure out this one using awk, perl or bash. Many thanks in Advance!!!!

Thank you,

Ambika

awk rna-seq text_processing • 251 views
ADD COMMENTlink modified 23 days ago by cpad01123.6k • written 23 days ago by Ambika20

Man! Where does this come from? I am eager to know ;)

ADD REPLYlink written 23 days ago by Vijay Lakhujani1.4k
1
gravatar for Kevin Blighe
23 days ago by
Kevin Blighe9.0k
Europe/Americas
Kevin Blighe9.0k wrote:

I wasn't sure of the purpose of all of the tab-spaces and quotation marks were.. so I first removed them with sed commands.

sed 's/ \+//g' file1.txt | sed 's/"\\t"/\t/g' | sed 's/"\\t//g' | sed '/^\s*$/d' | sed 's/[\\.]*$//g' | cut -f1-5 > file1v2.txt 
sed 's/ \+//g' file2.txt | sed 's/"\\t"/\t/g' | sed '/^\s*$/d' > file2v2.txt

cat file1v2.txt
**geneId**  **scaffold**    **LocusID** **pfam**    **PfamID**
1   aa  XL_012  pfam    P1234
2   bb  XL_013  pfam    P1324
3   cc  XL_014  pfam    P1624
3   cc  XL_014  pfam    P1326
4   dd  XL_015  pfam    P1326
4   dd  XL_015  pfam    P1326
4   dd  XL_015  pfam    P1326
5   ee  XL_016

cat file2v2.txt
**LocusID** **Expression**
XL_012  23
XL_013  20
XL_014  15
XL_015  10
XL_016  9


awk 'BEGIN {FS="\t"} FNR==NR {key=$3; arrayLookup[key]=$0; next} {key=$1; if (arrayLookup[key]) print $2"\t"arrayLookup[key]}' file1v2.txt file2v2.txt 
**Expression**  **geneId**  **scaffold**    **LocusID** **pfam**    **PfamID**
23              1           aa              XL_012      pfam        P1234
20              2           bb              XL_013      pfam        P1324
15              3           cc              XL_014      pfam        P1326
10              4           dd              XL_015      pfam        P1326
9               5           ee              XL_016
ADD COMMENTlink written 23 days ago by Kevin Blighe9.0k

Hello Kevin, I tried your codes however it is not giving me any output. Does it make any difference if I don't have first heading lines?

ADD REPLYlink written 22 days ago by Ambika20

Thank you so much kevin, It worked now but I only problem I have is number of rows for output file, which is less than file 1. I want the same number of rows as file1 because some of pfam domains are unique even though they are from same loci.

Please help and many thanks.

ADD REPLYlink modified 22 days ago • written 22 days ago by Ambika20
1

Oh, I see! Try this (it may require another modification):

awk 'BEGIN {FS="\t"} FNR==NR {key=$1; arrayLookup[key]=$0; next} {key=$3; if (arrayLookup[key]) print arrayLookup[key]"\t"$0}' file2v2.txt file1v2.txt
**LocusID** **Expression**  **geneId**  **scaffold**    **LocusID** **pfam**    **PfamID**
XL_012      23              1           aa              XL_012      pfam        P1234
XL_013      20              2           bb              XL_013      pfam        P1324
XL_014      15              3           cc              XL_014      pfam        P1624
XL_014      15              3           cc              XL_014      pfam        P1326
XL_015      10              4           dd              XL_015      pfam        P1326
XL_015      10              4           dd              XL_015      pfam        P1326
XL_015      10              4           dd              XL_015      pfam        P1326
XL_016       9              5           ee              XL_016
ADD REPLYlink written 22 days ago by Kevin Blighe9.0k
1

Thank you so much for your help Kevin!!!!

ADD REPLYlink written 22 days ago by Ambika20
0
gravatar for venu
23 days ago by
venu4.5k
Germany
venu4.5k wrote:

Not a bioinformatics question. You can try the following

join <(cat file1.txt | awk -F'\t' '{print $2 "\t" $0}' | sort) <(cat file2.txt | sort) | tr ' ' '\t' | cut -f1 --complement

P.S. I didn't test it as your example is very hard to reproduce. I guess, you need to select columns properly with cut command

ADD COMMENTlink written 23 days ago by venu4.5k

I tried but this is not giving me any output and also I want the lines to be in same order.

ADD REPLYlink written 23 days ago by Ambika20

post the out put of following commands

head file1.txt file2.txt
ADD REPLYlink written 23 days ago by venu4.5k

head file1.txt:

XL_012 "\t" 23

XL_013 "\t" 20

XL_014 "\t" 15

XL_015 "\t" 10

XL_016 "\t" 9

head file2.txt:

1 "\t" aa "\t" XL_012 "\t" 23 "\t" pfam "\t" P1234 "\t" .............

2 "\t" bb "\t" XL_013 "\t" 20 "\t" pfam "\t" P1324"\t ...........

3 "\t" cc "\t" XL_014 "\t" 15 "\t" pfam "\t" P1624"\t ...........

3"\t" cc "\t" XL_014 "\t" 15 "\t"pfam "\t" P1326"\t ...........

4"\t" dd "\t" XL_015 "\t" 10 "\t"pfam "\t" P1326"\t ...........

4"\t" dd "\t" XL_015 "\t" 10 "\t"pfam "\t" P1326"\t ...........

ADD REPLYlink written 23 days ago by Ambika20

What kind of a format is this? "\t" should mean a tab separated file but your file contains \t' instead oftab spaces`. Please try my answer on a tab separated file. It should work.

ADD REPLYlink written 23 days ago by venu4.5k

@venu It is a tab separated file. I wrote it like that so it would be easy. I will try and keep you updated. Thanks

ADD REPLYlink written 23 days ago by Ambika20

You don't need to add anything extra while posting some example data. Just paste few lines as it is and just mention it is a tab / space separated file.

ADD REPLYlink written 23 days ago by venu4.5k

If you are familiar with R, the table joining functions in the dplyr package may be helpful.

ADD REPLYlink written 23 days ago by mastal5111.7k

@mastal511, I am not an expert but I know little bit of it. Could you suggest what kind of scripts I can use?

ADD REPLYlink written 23 days ago by Ambika20

Have a look at this tutorial for the join functions:

http://stat545.com/bit001_dplyr-cheatsheet.html

ADD REPLYlink written 23 days ago by mastal5111.7k
0
gravatar for cpad0112
23 days ago by
cpad01123.6k
cpad01123.6k wrote:

output (remove uniq if output requires duplicated lines):

 $ join -t$'\t' -o 1.1,1.2,1.3,2.2,1.4,1.5  --header -1 3 -2 1 <(sort -k3 test1.txt|uniq -u) <(sort -k1 test2.txt) 

geneId  scaffold    LocusID Expression  pfam    PfamID
1   aa  XL_012  23  pfam    P1234
2   bb  XL_013  20  pfam    P1324
3   cc  XL_014  15  pfam    P1326
3   cc  XL_014  15  pfam    P1624
4   dd  XL_015  10  pfam    P1326

better output:

$ join -t$'\t' -o 1.2,1.3,1.1,2.2,1.4,1.5  --header -1 1 -2 1 <(datamash -g 3 unique 1,2,4,5 < test1.txt) <(sort -k1 test2.txt)

geneId  scaffold    LocusID Expression  pfam    PfamID
1   aa  XL_012  23  pfam    P1234
2   bb  XL_013  20  pfam    P1324
3   cc  XL_014  15  pfam    P1326,P1624
4   dd  XL_015  10  pfam    P1326

After cleaning input:

$ cat test1.txt 
geneId  scaffold    LocusID pfam    PfamID  
1   aa  XL_012  pfam    P1234   
2   bb  XL_013  pfam    P1324   
3   cc  XL_014  pfam    P1624   
3   cc  XL_014  pfam    P1326   
4   dd  XL_015  pfam    P1326   
4   dd  XL_015  pfam    P1326   
4   dd  XL_015  pfam    P1326

$ cat test2.txt 
cusID   Expression  
XL_012  23
XL_013  20
XL_014  15
XL_015  10
XL_016  9
ADD COMMENTlink modified 23 days ago • written 23 days ago by cpad01123.6k

Thank you for your suggestions but I don't know why it is not recognizing datamash in my system. It says it is not found.

ADD REPLYlink written 22 days ago by Ambika20

If you are using Ubuntu or mint or Debian, sudo apt-get install datamash -y should install the datamash package.

ADD REPLYlink modified 22 days ago • written 22 days ago by cpad01123.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1491 users visited in the last hour