Entering edit mode
9.8 years ago
Pierre Lindenbaum
161k
via
GNU datamash - me likey! http://t.co/euWLKGLeEc Seems very much designed with biologists in mind
— R⓪ss Mounce (@rmounce) August 4, 2014
Find Number of isoforms per gene
The gene identifiers are in column 13, the transcript identifiers are in column 2. To count how many isoforms each gene has, use datamash to group by column 13, and for each group, count the values in column 2 (use -s
to automatically sort the input file):
$ datamash -s -g 13 count 2 < genes.txt
ABCC1 1
ABCC10 2
ABCC11 3
ABCC12 1
ABCC13 2
...
Using the collapse
operation, datamash can print all the isoforms for each gene:
$ datamash -s -g 13 count 2 collapse 2 < genes.txt
ABCC1 1 NM_004996
ABCC10 2 NM_001198934,NM_033450
ABCC11 3 NM_032583,NM_033151,NM_145186
ABCC12 1 NM_033226
ABCC13 2 NR_003087,NR_003088
...
When using a file with a header line, add -H
:
$ datamash -H -s -g 13 count 2 collapse 2 < genes_h.txt
GroupBy(name2) count(name) collapse(name)
ABCC1 1 NM_004996
ABCC10 2 NM_001198934,NM_033450
ABCC11 3 NM_033151,NM_145186,NM_032583
ABCC12 1 NM_033226
ABCC13 2 NR_003088,NR_003087
...
Looks quite useful, thanks for sharing the tweet!