Tool: GNU datamash ( command-line program which performs simple calculation)
5
gravatar for Pierre Lindenbaum
5.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum127k wrote:

via https://twitter.com/rmounce/status/496198043075891201

GNU datamash - me likey! http://t.co/euWLKGLeEc Seems very much designed with biologists in mind

— Ross Mounce (@rmounce) August 4, 2014

 

Find Number of isoforms per gene

The gene identifiers are in column 13, the transcript identifiers are in column 2. To count how many isoforms each gene has, use datamash to group by column 13, and for each group, count the values in column 2 (use -s to automatically sort the input file):

$ datamash -s -g 13 count 2 < genes.txt
ABCC1   1
ABCC10  2
ABCC11  3
ABCC12  1
ABCC13  2
...

Using the collapse operation, datamash can print all the isoforms for each gene:

 

$ datamash -s -g 13 count 2 collapse 2 < genes.txt
ABCC1   1  NM_004996
ABCC10  2  NM_001198934,NM_033450
ABCC11  3  NM_032583,NM_033151,NM_145186
ABCC12  1  NM_033226
ABCC13  2  NR_003087,NR_003088
...

When using a file with a header line, add -H:

$ datamash -H -s -g 13 count 2 collapse 2 < genes_h.txt
GroupBy(name2)  count(name) collapse(name)
ABCC1           1           NM_004996
ABCC10          2           NM_001198934,NM_033450
ABCC11          3           NM_033151,NM_145186,NM_032583
ABCC12          1           NM_033226
ABCC13          2           NR_003088,NR_003087
...
cmdline utility tool linux • 1.6k views
ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Pierre Lindenbaum127k
1

Looks quite useful, thanks for sharing the tweet!

ADD REPLYlink written 5.7 years ago by Devon Ryan94k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1152 users visited in the last hour