Question: Transform duplicate rows into columns.
0
gravatar for unique379
12 months ago by
unique37950
Spain
unique37950 wrote:

Hi all, I would like to make duplicate row lines as columns. So it would be easier to extract repetitive columns as comma separated format.

These are transcript isoforms that means one gene has multiple transcript, which I want in one columns by comma separated.

Input:

1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868; Clk1;   protein_coding; Clk1-201;   protein_coding;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000129577; Clk1;   protein_coding; Clk1-204;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000135380; Clk1;   protein_coding; Clk1-206;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000148330; Clk1;   protein_coding; Clk1-210;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000151338; Clk1;   protein_coding; Clk1-211;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000156931; Clk1;   protein_coding; Clk1-212;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000186552; Clk1;   protein_coding; Clk1-213;   retained_intron;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836; Cbx3;   protein_coding; protein_coding; protein_coding;
6_51483010_51483011 ENSMUSG00000029836; ENSMUST00000114446; Cbx3;   protein_coding; protein_coding; protein_coding;
7_3717137_3717138   7_3717137_3717138   ENSMUST00000078451; ENSMUST00000129493; protein_coding; Pirb-201;   protein_coding;
7_3717137_3717138   ENSMUSG00000058818; ENSMUST00000129493; Pirb;   protein_coding; Pirb-201;   retained_intron;

Desired output:

1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868;ENSMUST00000129577;ENSMUST00000135380;ENSMUST00000148330;ENSMUST00000151338;ENSMUST00000156931;ENSMUST00000186552;   Clk1;   protein_coding; Clk1-201;Clk1-204;Clk1-206;Clk1-210;Clk1-211;Clk1-212;Clk1-213; 
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836;ENSMUST00000114446;  Cbx3;   protein_coding; protein_coding;protein_coding;  protein_coding;

7_3717137_3717138   7_3717137_3717138   ENSMUST00000078451;ENSMUST00000129493;  ENSMUST00000129493; protein_coding; Pirb-201;Pirb-201;  protein_coding;retained_intron;

I tried but this transform but i dont know how to find duplicates for each genes its different for each one basically i need repetitive isoform in one line:

awk 'ORS=NR%7?" ":"\n"'  test.txt

Also:

cut -f 1 test.txt | sort | uniq > gene
for f in `cat gene`; do  out=`cat test.txt | grep "$f" | tr '\n' '\t'`; echo $out; done > out.txt

but output is not something i desired. Please help

linux rna-seq annotations bash R • 398 views
ADD COMMENTlink modified 12 months ago by cpad011210k • written 12 months ago by unique37950

I just don't understand why you want to convert file into that format. working on duplicated rows is way much easier than this conversion

ADD REPLYlink modified 12 months ago • written 12 months ago by shoujun.gu350
3
gravatar for ramesh.8v
12 months ago by
ramesh.8v200
United States
ramesh.8v200 wrote:

A quick python3 solution:

make sure your input file is tab-separated

out = {}
with open("input.txt") as f:
    for lines in f:
        lines = lines.strip().split("\t")
        if lines[0] not in out:
            out[lines[0]] = [set(), set(), set(), set(), set(), set()]
        for i, vals in enumerate(lines[1:]):
            out[lines[0]][i].add(vals)

with open("outfile.txt", "w") as o:
    for kys, vls in out.items():
        o.write(kys+"\t")
        for vl in vls:
            o.write("|".join(vl)+"\t")
        o.write("\n")
ADD COMMENTlink modified 12 months ago • written 12 months ago by ramesh.8v200

Excellent Thank you.

ADD REPLYlink written 12 months ago by unique37950
0
gravatar for cpad0112
12 months ago by
cpad011210k
India
cpad011210k wrote:

output (there was an extract space in last two rows. To remove it sed was used. If data is uniform, you can use datamash direct):

$ sed 's/\s\+/ /g' test.txt | datamash -t " " -sg 1 unique 2-6 | sed 's/;,/;/g'

1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868;ENSMUST00000129577;ENSMUST00000135380;ENSMUST00000148330;ENSMUST00000151338;ENSMUST00000156931;ENSMUST00000186552; Clk1; protein_coding; Clk1-201;Clk1-204;Clk1-206;Clk1-210;Clk1-211;Clk1-212;Clk1-213;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836;ENSMUST00000114446; Cbx3; protein_coding; protein_coding;
7_3717137_3717138 7_3717137_3717138,ENSMUSG00000058818; ENSMUST00000078451;ENSMUST00000129493; ENSMUST00000129493;Pirb; protein_coding; Pirb-201;

input (copy/pasted from OP):

$ cat test.txt 
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000034868; Clk1;   protein_coding; Clk1-201;   protein_coding;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000129577; Clk1;   protein_coding; Clk1-204;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000135380; Clk1;   protein_coding; Clk1-206;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000148330; Clk1;   protein_coding; Clk1-210;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000151338; Clk1;   protein_coding; Clk1-211;   nonsense_mediated_decay;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000156931; Clk1;   protein_coding; Clk1-212;   retained_intron;
1_58420023_58420024 ENSMUSG00000026034; ENSMUST00000186552; Clk1;   protein_coding; Clk1-213;   retained_intron;
6_51483010_51483011 ENSMUSG00000029836; ENSMUSG00000029836; Cbx3;   protein_coding; protein_coding; protein_coding;
6_51483010_51483011 ENSMUSG00000029836; ENSMUST00000114446; Cbx3;   protein_coding; protein_coding; protein_coding;
7_3717137_3717138   7_3717137_3717138   ENSMUST00000078451; ENSMUST00000129493; protein_coding; Pirb-201;   protein_coding;
7_3717137_3717138   ENSMUSG00000058818; ENSMUST00000129493; Pirb;   protein_coding; Pirb-201;   retained_intron;
ADD COMMENTlink written 12 months ago by cpad011210k

Thank you for introducing datamash, i never knew before. I have to install and check if its work for me. But thank you.

ADD REPLYlink written 12 months ago by unique37950

if you are using any one of the main distros (RHEL, centos, ubuntu, mint), datamash is in repos.

ADD REPLYlink written 12 months ago by cpad011210k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2189 users visited in the last hour