Question

How to merge multiple files with different families and their count numbers for different genomes

0

Entering edit mode

2.4 years ago

rishi.bhandari63 • 0

Hi

I am trying to combine the output from protein domain search in my multiple genomes and want to combine the final results to make a heatmap based on the number of predicted domain. I have my results as following

Genome  Family  Count
A   GH1 1
A   CE9 2
A   GT2 3
A   CBM2+GH6    9
A   CBM50   4

Similarly my second file looks like

Genome  Family  Count
B   GH1 5
B   GH51    1
B   AA3 5
B   CBM2+GH6    2
B   GT2 3

and so on for around 150 genomes. I want to have my output as

    GH1    GH51 GT2   CE9    AA3    CBM2+GH6    CBM50
A     1       0   3     2          0             9                  4
B     5       1   3     0           5             2                 0

Can you please let me know the best way to do this?

protein domain • 648 views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 2.4 years ago by rishi.bhandari63 • 0

score 0 · Answer 1 · 2021-12-21

0

Entering edit mode

2.4 years ago

Pierre Lindenbaum 161k

have a look at GNU datamash + crosstab https://www.gnu.org/software/datamash/

something like (not tested)

cat input*.tsv | sort -t $'\t' -k1,1 -k2,2 | datamash -s crosstab 1,2 sum 3

ADD COMMENT • link 2.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

that would fail due to headers (strings).

$ sed -s 1d *.txt | datamash crosstab 1,2 sum 3 --filler 0

    AA3 CBM2+GH6    CBM50   CE9 GH1 GH51    GT2
A   0   9   4   2   1   0   3
B   5   2   0   0   5   1   3

ADD REPLY • link 2.4 years ago by cpad0112 21k