how to sort the rows of a file as a matrix?
2
0
Entering edit mode
2.8 years ago
BATMAN • 0

I would like to know how I can sort the rows of a file in the following way:

My file is file.txt (tab delimited):

g1: 00A98_01563 00554_01552 CCUG38_01373    
d2: 00554_01444
g3_3: 00A98_04566 CCUG38_05322

I want to get this (tab delimited):

 00A98 00554 CCUG38
g1 1 1 1
d2 0 1 0
g3_3 1 0 1

How can I do it with the command line?

All the best, Regards

linux • 1.1k views
ADD COMMENT
0
Entering edit mode

I think this will be quite cumbersome on the commandline alone I think. This is a task for python or R really.

ADD REPLY
7
Entering edit mode
2.8 years ago
awk  '{for(i=2;i<=NF;i++) print $1"\t"$i;}' input.tsv |  datamash crosstab 1,2  --filler=0
    strainA_Nbr strainB_Nbr strainC_Nbr
g1  1   1   1
g2  0   1   0
g3  1   0   1
ADD COMMENT
0
Entering edit mode

Thanks a lot bro for the help! i am dealing with the query_pan_genome output of roary. I updated the query example, I apologize and if your help returned. Seeing your answer, I should convert to this the table and then your command?

Initial:

g1: 00A98_01563 00554_01552 CCUG38_01373    
d2: 00554_01444
g3_3: 00A98_04566 CCUG38_05322

Delete what follows after the "_"

g1: 00A98 00554 CCUG38  
d2: 00554
g3_3: 00A98 CCUG38

Your command:

00A98 00554 CCUG38
g1 1 1 1
d2 0 1 0
g3_3 1 0 1

How can I do?

ADD REPLY
2
Entering edit mode

You just need to update the awk code :

awk '{gsub(":","",$1);for(i=2;i<=NF;i++) {split($i,a,"_"); print $1"\t"a[1];}}' input.tsv | datamash crosstab 1,2 --filler=0

ADD REPLY
3
Entering edit mode
2.8 years ago

Convoluted to use Python, but this seems to give the desired output:

#!/usr/bin/env python

import sys
import io

rows_str = '''g1: 00A98_01563 00554_01552 CCUG38_01373    
d2: 00554_01444
g3_3: 00A98_04566 CCUG38_05322'''

# read in records
rows_fh = io.StringIO(rows_str)
records = {}
keys = set()
for row in rows_fh:
    elems = row.split()
    value = elems[0].replace(':', '')
    if value not in records: records[value] = set()
    ks = [x.split('_')[0] for x in elems[1:]]
    for k in ks:
        records[value].add(k)
        keys.add(k)

# output matrix
o = [''] + [k for k in keys]
ol = '\t'.join(o) + '\n'
sys.stdout.write(ol)
for rk, rv in records.items():
    o = [rk]
    for k in keys:
        v = '1' if k in rv else '0'
        o.append(v)
    ol = '\t'.join(o)  + '\n'
    sys.stdout.write(ol)

Output:

% ./so9478485.py
    CCUG38  00A98   00554
g1  1   1   1
d2  0   0   1
g3_3    1   1   0
ADD COMMENT

Login before adding your answer.

Traffic: 3000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6