Question: Transform duplicate rows into columns.
0
gravatar for unique379
6 days ago by
unique37950
Spain
unique37950 wrote:

Hi all, I would like to make duplicate row lines as columns. So it would be easier to extract repetitive columns as comma separated format.

These are transcript isoforms that means one gene has multiple transcript, which I want in one columns by comma separated.

Input:

1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868; Clk1;   protein_coding; Clk1-201;   protein_coding;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000129577; Clk1;   protein_coding; Clk1-204;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000135380; Clk1;   protein_coding; Clk1-206;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000148330; Clk1;   protein_coding; Clk1-210;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000151338; Clk1;   protein_coding; Clk1-211;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000156931; Clk1;   protein_coding; Clk1-212;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000186552; Clk1;   protein_coding; Clk1-213;   retained_intron;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836; Cbx3;   protein_coding; protein_coding; protein_coding;
6_51483010_51483011 ENSMUSG00000029836; ENSMUST00000114446; Cbx3;   protein_coding; protein_coding; protein_coding;
7_3717137_3717138   7_3717137_3717138   ENSMUST00000078451; ENSMUST00000129493; protein_coding; Pirb-201;   protein_coding;
7_3717137_3717138   ENSMUSG00000058818; ENSMUST00000129493; Pirb;   protein_coding; Pirb-201;   retained_intron;

Desired output:

1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868;ENSMUST00000129577;ENSMUST00000135380;ENSMUST00000148330;ENSMUST00000151338;ENSMUST00000156931;ENSMUST00000186552;   Clk1;   protein_coding; Clk1-201;Clk1-204;Clk1-206;Clk1-210;Clk1-211;Clk1-212;Clk1-213; 
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836;ENSMUST00000114446;  Cbx3;   protein_coding; protein_coding;protein_coding;  protein_coding;

7_3717137_3717138   7_3717137_3717138   ENSMUST00000078451;ENSMUST00000129493;  ENSMUST00000129493; protein_coding; Pirb-201;Pirb-201;  protein_coding;retained_intron;

I tried but this transform but i dont know how to find duplicates for each genes its different for each one basically i need repetitive isoform in one line:

awk 'ORS=NR%7?" ":"\n"'  test.txt

Also:

cut -f 1 test.txt | sort | uniq > gene
for f in `cat gene`; do  out=`cat test.txt | grep "$f" | tr '\n' '\t'`; echo $out; done > out.txt

but output is not something i desired. Please help

linux rna-seq annotations bash R • 149 views
ADD COMMENTlink modified 5 days ago by cpad01123.6k • written 6 days ago by unique37950

I just don't understand why you want to convert file into that format. working on duplicated rows is way much easier than this conversion

ADD REPLYlink modified 6 days ago • written 6 days ago by shoujun.gu290
3
gravatar for ramesh.8v
5 days ago by
ramesh.8v180
United States
ramesh.8v180 wrote:

A quick python3 solution:

make sure your input file is tab-separated

out = {}
with open("input.txt") as f:
    for lines in f:
        lines = lines.strip().split("\t")
        if lines[0] not in out:
            out[lines[0]] = [set(), set(), set(), set(), set(), set()]
        for i, vals in enumerate(lines[1:]):
            out[lines[0]][i].add(vals)

with open("outfile.txt", "w") as o:
    for kys, vls in out.items():
        o.write(kys+"\t")
        for vl in vls:
            o.write("|".join(vl)+"\t")
        o.write("\n")
ADD COMMENTlink modified 5 days ago • written 5 days ago by ramesh.8v180

Excellent Thank you.

ADD REPLYlink written 5 days ago by unique37950
0
gravatar for cpad0112
5 days ago by
cpad01123.6k
cpad01123.6k wrote:

output (there was an extract space in last two rows. To remove it sed was used. If data is uniform, you can use datamash direct):

$ sed 's/\s\+/ /g' test.txt | datamash -t " " -sg 1 unique 2-6 | sed 's/;,/;/g'

1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868;ENSMUST00000129577;ENSMUST00000135380;ENSMUST00000148330;ENSMUST00000151338;ENSMUST00000156931;ENSMUST00000186552; Clk1; protein_coding; Clk1-201;Clk1-204;Clk1-206;Clk1-210;Clk1-211;Clk1-212;Clk1-213;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836;ENSMUST00000114446; Cbx3; protein_coding; protein_coding;
7_3717137_3717138 7_3717137_3717138,ENSMUSG00000058818; ENSMUST00000078451;ENSMUST00000129493; ENSMUST00000129493;Pirb; protein_coding; Pirb-201;

input (copy/pasted from OP):

$ cat test.txt 
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868; Clk1;   protein_coding; Clk1-201;   protein_coding;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000129577; Clk1;   protein_coding; Clk1-204;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000135380; Clk1;   protein_coding; Clk1-206;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000148330; Clk1;   protein_coding; Clk1-210;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000151338; Clk1;   protein_coding; Clk1-211;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000156931; Clk1;   protein_coding; Clk1-212;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000186552; Clk1;   protein_coding; Clk1-213;   retained_intron;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836; Cbx3;   protein_coding; protein_coding; protein_coding;
6_51483010_51483011 ENSMUSG00000029836; ENSMUST00000114446; Cbx3;   protein_coding; protein_coding; protein_coding;
7_3717137_3717138   7_3717137_3717138   ENSMUST00000078451; ENSMUST00000129493; protein_coding; Pirb-201;   protein_coding;
7_3717137_3717138   ENSMUSG00000058818; ENSMUST00000129493; Pirb;   protein_coding; Pirb-201;   retained_intron;
ADD COMMENTlink written 5 days ago by cpad01123.6k

Thank you for introducing datamash, i never knew before. I have to install and check if its work for me. But thank you.

ADD REPLYlink written 5 days ago by unique37950

if you are using any one of the main distros (RHEL, centos, ubuntu, mint), datamash is in repos.

ADD REPLYlink written 5 days ago by cpad01123.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1431 users visited in the last hour