Question

Tool:GNU datamash ( command-line program which performs simple calculation)

5

Entering edit mode

9.8 years ago

Pierre Lindenbaum 161k

via

GNU datamash - me likey! http://t.co/euWLKGLeEc Seems very much designed with biologists in mind
— R⓪ss Mounce (@rmounce) August 4, 2014

Find Number of isoforms per gene

The gene identifiers are in column 13, the transcript identifiers are in column 2. To count how many isoforms each gene has, use datamash to group by column 13, and for each group, count the values in column 2 (use -s to automatically sort the input file):

$ datamash -s -g 13 count 2 < genes.txt
ABCC1   1
ABCC10  2
ABCC11  3
ABCC12  1
ABCC13  2
...

Using the collapse operation, datamash can print all the isoforms for each gene:

$ datamash -s -g 13 count 2 collapse 2 < genes.txt
ABCC1   1  NM_004996
ABCC10  2  NM_001198934,NM_033450
ABCC11  3  NM_032583,NM_033151,NM_145186
ABCC12  1  NM_033226
ABCC13  2  NR_003087,NR_003088
...

When using a file with a header line, add -H:

$ datamash -H -s -g 13 count 2 collapse 2 < genes_h.txt
GroupBy(name2)  count(name) collapse(name)
ABCC1           1           NM_004996
ABCC10          2           NM_001198934,NM_033450
ABCC11          3           NM_033151,NM_145186,NM_032583
ABCC12          1           NM_033226
ABCC13          2           NR_003088,NR_003087
...

linux cmdline utility • 2.4k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Looks quite useful, thanks for sharing the tweet!

ADD REPLY • link 9.8 years ago by Devon Ryan 104k