Question

Transform duplicate rows into columns.

0

Entering edit mode

6.4 years ago

unique379 ▴ 90

Hi all, I would like to make duplicate row lines as columns. So it would be easier to extract repetitive columns as comma separated format.

These are transcript isoforms that means one gene has multiple transcript, which I want in one columns by comma separated.

Input:

1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868; Clk1;   protein_coding; Clk1-201;   protein_coding;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000129577; Clk1;   protein_coding; Clk1-204;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000135380; Clk1;   protein_coding; Clk1-206;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000148330; Clk1;   protein_coding; Clk1-210;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000151338; Clk1;   protein_coding; Clk1-211;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000156931; Clk1;   protein_coding; Clk1-212;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000186552; Clk1;   protein_coding; Clk1-213;   retained_intron;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836; Cbx3;   protein_coding; protein_coding; protein_coding;
6_51483010_51483011 ENSMUSG00000029836; ENSMUST00000114446; Cbx3;   protein_coding; protein_coding; protein_coding;
7_3717137_3717138   7_3717137_3717138   ENSMUST00000078451; ENSMUST00000129493; protein_coding; Pirb-201;   protein_coding;
7_3717137_3717138   ENSMUSG00000058818; ENSMUST00000129493; Pirb;   protein_coding; Pirb-201;   retained_intron;

Desired output:

1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868;ENSMUST00000129577;ENSMUST00000135380;ENSMUST00000148330;ENSMUST00000151338;ENSMUST00000156931;ENSMUST00000186552;   Clk1;   protein_coding; Clk1-201;Clk1-204;Clk1-206;Clk1-210;Clk1-211;Clk1-212;Clk1-213; 
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836;ENSMUST00000114446;  Cbx3;   protein_coding; protein_coding;protein_coding;  protein_coding;

7_3717137_3717138   7_3717137_3717138   ENSMUST00000078451;ENSMUST00000129493;  ENSMUST00000129493; protein_coding; Pirb-201;Pirb-201;  protein_coding;retained_intron;

I tried but this transform but i dont know how to find duplicates for each genes its different for each one basically i need repetitive isoform in one line:

awk 'ORS=NR%7?" ":"\n"'  test.txt

Also:

cut -f 1 test.txt | sort | uniq > gene
for f in `cat gene`; do  out=`cat test.txt | grep "$f" | tr '\n' '\t'`; echo $out; done > out.txt

but output is not something i desired. Please help

annotations RNA-Seq bash R linux • 1.5k views

ADD COMMENT • link updated 6.4 years ago by cpad0112 21k • written 6.4 years ago by unique379 ▴ 90

0

Entering edit mode

I just don't understand why you want to convert file into that format. working on duplicated rows is way much easier than this conversion

ADD REPLY • link 6.4 years ago by shoujun.gu ▴ 380

0

Entering edit mode

6.4 years ago

cpad0112 21k

output (there was an extract space in last two rows. To remove it sed was used. If data is uniform, you can use datamash direct):

$ sed 's/\s\+/ /g' test.txt | datamash -t " " -sg 1 unique 2-6 | sed 's/;,/;/g'

1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868;ENSMUST00000129577;ENSMUST00000135380;ENSMUST00000148330;ENSMUST00000151338;ENSMUST00000156931;ENSMUST00000186552; Clk1; protein_coding; Clk1-201;Clk1-204;Clk1-206;Clk1-210;Clk1-211;Clk1-212;Clk1-213;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836;ENSMUST00000114446; Cbx3; protein_coding; protein_coding;
7_3717137_3717138 7_3717137_3717138,ENSMUSG00000058818; ENSMUST00000078451;ENSMUST00000129493; ENSMUST00000129493;Pirb; protein_coding; Pirb-201;

input (copy/pasted from OP):

$ cat test.txt 
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868; Clk1;   protein_coding; Clk1-201;   protein_coding;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000129577; Clk1;   protein_coding; Clk1-204;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000135380; Clk1;   protein_coding; Clk1-206;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000148330; Clk1;   protein_coding; Clk1-210;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000151338; Clk1;   protein_coding; Clk1-211;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000156931; Clk1;   protein_coding; Clk1-212;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000186552; Clk1;   protein_coding; Clk1-213;   retained_intron;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836; Cbx3;   protein_coding; protein_coding; protein_coding;
6_51483010_51483011 ENSMUSG00000029836; ENSMUST00000114446; Cbx3;   protein_coding; protein_coding; protein_coding;
7_3717137_3717138   7_3717137_3717138   ENSMUST00000078451; ENSMUST00000129493; protein_coding; Pirb-201;   protein_coding;
7_3717137_3717138   ENSMUSG00000058818; ENSMUST00000129493; Pirb;   protein_coding; Pirb-201;   retained_intron;

ADD COMMENT • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

Thank you for introducing datamash, i never knew before. I have to install and check if its work for me. But thank you.

ADD REPLY • link 6.4 years ago by unique379 ▴ 90

0

Entering edit mode

if you are using any one of the main distros (RHEL, centos, ubuntu, mint), datamash is in repos.

ADD REPLY • link 6.4 years ago by cpad0112 21k

score 3 · Accepted Answer · 2017-12-06

3

Entering edit mode

6.4 years ago

second_exon ▴ 210

A quick python3 solution:

make sure your input file is tab-separated

out = {}
with open("input.txt") as f:
    for lines in f:
        lines = lines.strip().split("\t")
        if lines[0] not in out:
            out[lines[0]] = [set(), set(), set(), set(), set(), set()]
        for i, vals in enumerate(lines[1:]):
            out[lines[0]][i].add(vals)

with open("outfile.txt", "w") as o:
    for kys, vls in out.items():
        o.write(kys+"\t")
        for vl in vls:
            o.write("|".join(vl)+"\t")
        o.write("\n")