Question

Separate letters in the same column into different rows

0

Entering edit mode

21 months ago

genomes_and_MGEs ▴ 10

Hey everyone,

I have a text file named COGs.txt as follows

COG_category    Element_type    Phylum
LA       Stat     Proteobacteria
E       Stat     Firmicutes
KS       Bact     Proteobacteria
-       Bact     Firmicutes
S       Bact     Firmicutes

My goal here, is to count the number of occurrences each letter is present in column COG_category, and group by Element_type and Phylum. The problem is that some rows have more than one letter in column COG_category. I know I can use something like

grep -o '[A-Z]' COGs.txt | sort | uniq -c > uniq_counts_COGs.txt

This will output the number of occurrences of each letter, but doesn't group the letters by Element_type and Phylum. Maybe using datamash will help? If you group by COG_category, this will group the letters without separating them.

Many thanks!

sequence • 388 views

ADD COMMENT • link updated 21 months ago by Pierre Lindenbaum 161k • written 21 months ago by genomes_and_MGEs ▴ 10

score 3 · Answer 1 · 2022-07-20

3

Entering edit mode

21 months ago

Pierre Lindenbaum 161k

grep -v '^COG_category' COGs.txt  | awk '{L=length($1);for(i=1;i<=L;i++) {printf("%s\t%s\t%s\n",substr($1,i,1),$2,$3);} }' | sort | uniq -c

ADD COMMENT • link 21 months ago by Pierre Lindenbaum 161k