Tool:GNU datamash ( command-line program which performs simple calculation)
0
5
Entering edit mode
9.8 years ago

via

Find Number of isoforms per gene

The gene identifiers are in column 13, the transcript identifiers are in column 2. To count how many isoforms each gene has, use datamash to group by column 13, and for each group, count the values in column 2 (use -s to automatically sort the input file):

$ datamash -s -g 13 count 2 < genes.txt
ABCC1   1
ABCC10  2
ABCC11  3
ABCC12  1
ABCC13  2
...

Using the collapse operation, datamash can print all the isoforms for each gene:

$ datamash -s -g 13 count 2 collapse 2 < genes.txt
ABCC1   1  NM_004996
ABCC10  2  NM_001198934,NM_033450
ABCC11  3  NM_032583,NM_033151,NM_145186
ABCC12  1  NM_033226
ABCC13  2  NR_003087,NR_003088
...

When using a file with a header line, add -H:

$ datamash -H -s -g 13 count 2 collapse 2 < genes_h.txt
GroupBy(name2)  count(name) collapse(name)
ABCC1           1           NM_004996
ABCC10          2           NM_001198934,NM_033450
ABCC11          3           NM_033151,NM_145186,NM_032583
ABCC12          1           NM_033226
ABCC13          2           NR_003088,NR_003087
...
linux cmdline utility • 2.4k views
ADD COMMENT
1
Entering edit mode

Looks quite useful, thanks for sharing the tweet!

ADD REPLY

Login before adding your answer.

Traffic: 1254 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6