How to merge multiple files with different families and their count numbers for different genomes
1
0
Entering edit mode
2.4 years ago

Hi

I am trying to combine the output from protein domain search in my multiple genomes and want to combine the final results to make a heatmap based on the number of predicted domain. I have my results as following

Genome  Family  Count
A   GH1 1
A   CE9 2
A   GT2 3
A   CBM2+GH6    9
A   CBM50   4

Similarly my second file looks like

Genome  Family  Count
B   GH1 5
B   GH51    1
B   AA3 5
B   CBM2+GH6    2
B   GT2 3

and so on for around 150 genomes. I want to have my output as

    GH1    GH51 GT2   CE9    AA3    CBM2+GH6    CBM50
A     1       0   3     2          0             9                  4
B     5       1   3     0           5             2                 0

Can you please let me know the best way to do this?

protein domain • 648 views
ADD COMMENT
0
Entering edit mode
2.4 years ago

have a look at GNU datamash + crosstab https://www.gnu.org/software/datamash/

something like (not tested)

cat input*.tsv | sort -t $'\t' -k1,1 -k2,2 | datamash -s crosstab 1,2 sum 3
ADD COMMENT
0
Entering edit mode

that would fail due to headers (strings).

$ sed -s 1d *.txt | datamash crosstab 1,2 sum 3 --filler 0

    AA3 CBM2+GH6    CBM50   CE9 GH1 GH51    GT2
A   0   9   4   2   1   0   3
B   5   2   0   0   5   1   3
ADD REPLY

Login before adding your answer.

Traffic: 2640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6