Question: help in converting a table
0
gravatar for Chris
12 months ago by
Chris30
Chris30 wrote:

Hi all, I have a table as the example below which I need to convert to a table where the first column is the name of genes followed by the expression values for that particular gene.

hsa-mir-1229    1.197748
hsa-mir-3163    0.000000
hsa-mir-3175    0.000000
hsa-mir-3189    0.149719
hsa-mir-3619    3.144089
hsa-mir-3646    0.149719
hsa-mir-3658    0.000000
hsa-mir-3662    1.497185
hsa-mir-3928    1.347467
hsa-mir-4262    0.000000
hsa-mir-1229    1.000000
hsa-mir-3163    0.800000
hsa-mir-3175    0.030000
hsa-mir-3189    0.149719
hsa-mir-3619    3.144089
hsa-mir-3646    0.449719
hsa-mir-3658    0.490719
hsa-mir-3662    0.497185
hsa-mir-4262    0.700000
----------
hsa-mir-1229    1.197748   1.000000
hsa-mir-3163    0.000000   0.800000
hsa-mir-3175    0.000000   0.030000
hsa-mir-3189    0.149719   0.149719
hsa-mir-3619    3.144089   3.144089
hsa-mir-3646    0.149719   0.449719
hsa-mir-3658    0.000000   0.490719
hsa-mir-3662    1.497185   0.497185
hsa-mir-3928    1.347467     NA
hsa-mir-4262    0.000000   0.700000

Any help is highly appreciated. thank you

rna-seq convert format • 515 views
ADD COMMENTlink modified 12 months ago by cpad011211k • written 12 months ago by Chris30

Thank you for your answer, I think I did not explain myself properly.

This is what I get:

hsa-mir-1229 1.197748hsa-mir-3163 0.000000hsa-mir-3175 0.000000hsa-mir-3189 0.149719hsa-mir-3619 3.144089hsa-mir-3646 0.149719hsa-mir-3658 0.000000hsa-mir-3662 1.497185hsa-mir-3928 1.347467

as an output rather to what I describe earlier.

Again thank you so much for your help

ADD REPLYlink modified 12 months ago by genomax65k • written 12 months ago by Chris30

I 've tried excel and still I am getting the same format. All genes and expressions are in one line.

ADD REPLYlink written 12 months ago by Chris30

Please do not add answers unless you're answering your own top level question. These are responses to other people's comments, which should be added as comment replies. If appropriate, your post should be edited. See: How to add a comment reply and How to edit your post

ADD REPLYlink written 12 months ago by RamRS21k
1
gravatar for st.ph.n
12 months ago by
st.ph.n2.4k
Philadelphia, PA
st.ph.n2.4k wrote:

Here's a quick python solution. I'm assuming your table is tab-delimited, and it will output tab-delimited.

#!/usr/bin/env python
import sys
from collections import defaultdict

with open(sys.argv[1], 'r') as f:
    genes = defaultdict(list)
    for line in f:
        genes[line.strip().split('\t')[0]].append(line.strip().split('\t')[1])

with open(sys.argv[2], 'w') as out:
    for i in genes:
        out.write(i + '\t' + '\t'.join(genes[i]))

Save code as format_table.py, run as python format_table.py input.txt output.txt

ADD COMMENTlink written 12 months ago by st.ph.n2.4k
2

Missing a line break in the out.write line. Change the last line to:

out.write(i + '\t' + '\t'.join(genes[i]) + '\n')
ADD REPLYlink written 12 months ago by Damian Kao15k

Yep, using write() in Python requires adding line terminators. Easy to miss.

ADD REPLYlink written 12 months ago by Alex Reynolds28k

I'm more use to 2.X syntax so I wasn't aware. Thanks.

ADD REPLYlink written 12 months ago by st.ph.n2.4k

thank you so much for your quick answer. I don't have any experience in python and I am willing to learn it. When I run your code somehow all the data in the output are place in one line like that:

hsa-mir-4649    0.449156    0.000000    0.808519    0.000000    0.543301    0.000000    0.000000    0.000000    0.000000    0.252087    0.000000    0.000000    0.000000    0.000000    0.245496    0.000000    0.000000    0.090737    0.000000    0.082297    0.000000    0.000000    0.000000    0.671861    0.207298    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.389510    0.428801    0.000000    0.000000    0.000000    0.000000    0.264927    0.000000    0.000000    0.000000    0.376946    0.316828    0.086942    0.396281    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.259527    0.107388    0.271481    0.000000    0.000000    0.295719    0.000000    0.177120    0.000000    0.175829    0.000000    0.477968    0.201986    0.482697    0.000000    0.810934    0.000000    0.000000    0.152771    0.349035    0.216139    0.000000    0.000000    0.310665    0.000000    0.000000    0.000000    0.997040    0.000000    0.000000    0.000000    0.000000    0.000000    0.140049    0.000000    0.000000    0.000000    0.000000    0.085340    0.000000    0.000000    0.000000    0.000000    0.202206    0.000000    0.000000    0.426247    0.111463    0.136478    0.103880    0.000000    0.518896    0.000000    0.000000    0.000000    0.219701    0.187424    0.000000    0.282888    0.190998    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.773226    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.392962    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.231889    0.000000    0.233245    0.000000    0.432839    0.363025    0.291168    0.000000    0.000000    0.000000    0.000000    0.155571    0.231324    0.000000    0.112256    0.212441    0.000000    0.000000    0.000000    0.000000    0.526724    0.000000    0.000000    0.494783    0.000000    0.335871    0.646818    0.000000    0.881312    0.000000    0.000000    0.157905    0.000000    0.186378    0.189747    0.513107    0.304384    0.450149    0.000000    0.000000    0.147183    0.000000    0.000000    0.402776    0.000000    0.123698    0.293994    0.000000    0.000000    0.000000    1.520795    0.000000    0.150698    0.349743    0.000000    0.000000    0.412048    0.000000    0.000000    0.000000    0.126304    0.000000    0.146505    0.000000    0.000000    0.000000    0.590510    0.000000    0.000000    0.692906    0.000000    0.000000    0.080122    0.000000    0.339625    0.135295    0.000000    0.000000    0.000000    0.000000    0.226271    0.238614    0.675249    0.000000    0.000000    0.109565    1.215094    0.122594    0.000000    0.754284    0.177904    0.000000    0.389059    0.170698    0.085960    0.000000    0.000000    0.000000    0.136473    0.299894    0.166363    0.135930    0.369595    0.617371    0.273669    0.000000    0.000000    0.000000    0.278574    0.000000    0.000000    0.000000    0.467559    0.052991    0.000000    0.073191    0.000000    0.000000    0.755511    0.000000    0.393010    0.000000    0.000000    0.123346    0.182877    0.305522    0.310795    1.063833    0.396389    0.447259    0.000000    0.114615    0.199650    0.000000    0.141586    0.362416    0.000000    0.000000    0.919735    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.360965    0.000000    0.165587    0.000000    0.360406    1.108463    0.175386    0.000000    0.565773    0.370455    0.000000    0.000000    0.173638    0.000000    0.000000    0.000000    0.244640    0.000000    0.000000    0.000000    0.178119    0.139975    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.161680    0.000000    0.000000    0.387075    0.367204    0.000000    0.528403    0.000000    0.000000    0.000000    0.231294    0.000000    0.112177    0.421011    0.000000    0.083285    0.522824    0.000000    0.000000    0.187261    0.581464    0.756509    0.000000    0.000000    1.517789    0.000000    1.024637    0.000000    0.201494    0.198940    0.000000    0.000000    0.061106    0.237552    0.233458    0.000000    0.000000    0.000000    0.000000    0.478040    0.000000    0.000000    0.166597    0.552751    0.400058    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.533696    0.000000    0.370002    0.000000    0.000000    0.000000    0.000000    0.255239    0.000000    0.000000    0.000000    0.275042    0.000000    0.000000    0.235675    0.000000    0.000000    0.578777    0.000000    0.429569    0.357804    0.414087    0.375918    0.000000    0.000000    0.170525    0.000000    0.000000    0.397967    0.000000    0.234764    0.199694    0.214717    0.000000    0.226096    0.000000    0.000000    0.000000    0.000000    0.000000    0.342193    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.136683    0.000000    0.000000    0.382529    0.000000    0.000000    0.159683    0.214213    0.000000    0.160232    0.000000hsa-mir-4500    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
ADD REPLYlink modified 12 months ago by genomax65k • written 12 months ago by Chris30

How many times does hsa-mir-4649 appear in your input file? This should equal (+1) the number of columns in the output for that gene id.

grep -e 'hsa-mir-4649' | wc -l

How many genes in your input file? How many lines in your output file? These should be identical.

cut -f 1 input.txt | sort | uniq | wc -l 

wc -l output.txt
ADD REPLYlink modified 12 months ago • written 12 months ago by st.ph.n2.4k

Sometimes new line characters aren't interpreted properly, depending on how you're opening your file, e.g. notepad. Trying using the command-line, less output.txt, or open in Excel if it is not too large.

ADD REPLYlink written 12 months ago by st.ph.n2.4k
1
gravatar for avek
12 months ago by
avek10
avek10 wrote:

try

#!/bin/bash

awk '{arr[$1]=arr[$1]"\t"$2} END {for(g in arr) printf "%s\t%s\n", g, arr[g]}' ~/Desktop/myfile.txt

where myfile.txt is your file actually. You will get something like:

hsa-mir-3646        0.149719    0.449719
hsa-mir-3619        3.144089    3.144089
hsa-mir-3163        0.000000    0.800000
hsa-mir-3928        1.347467
hsa-mir-3658        0.000000    0.490719
hsa-mir-3175        0.000000    0.030000
hsa-mir-4262        0.000000    0.700000
hsa-mir-3189        0.149719    0.149719
hsa-mir-3662        1.497185    0.497185
hsa-mir-1229        1.197748    1.000000
ADD COMMENTlink written 12 months ago by avek10
2
gravatar for cpad0112
12 months ago by
cpad011211k
India
cpad011211k wrote:
$ datamash -s  -g 1 collapse 2 < test.txt | awk -F "," -v OFS="," '{ if(!$2) $2 = "NA" }; 1'| sed 's/,/\t/g'
hsa-mir-1229    1.197748    1.000000
hsa-mir-3163    0.000000    0.800000
hsa-mir-3175    0.000000    0.030000
hsa-mir-3189    0.149719    0.149719
hsa-mir-3619    3.144089    3.144089
hsa-mir-3646    0.149719    0.449719
hsa-mir-3658    0.000000    0.490719
hsa-mir-3662    1.497185    0.497185
hsa-mir-3928    1.347467    NA
hsa-mir-4262    0.000000    0.700000
ADD COMMENTlink modified 12 months ago • written 12 months ago by cpad011211k
0
gravatar for Alex Reynolds
12 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

The other answers to this question do not add NA fields where there are missing values. If you want NA values in your table to "fill in the gaps", the script will have to do a little more work.

Here's an example of how that might be done. There are other ways to do this, of course. This assumes that your data are tab-delimited.

#!/usr/bin/env python

import sys

d = {}
m = 0

for line in sys.stdin:
    (k,v) = line.strip().split('\t')
    if k not in d:
        d[k] = []
    d[k].append(v)
    l = len(d[k])
    if l > m:
        m = l

for k in d.keys():
    for i in range(len(d[k]), m):
        d[k].append('NA')
    sys.stdout.write("%s\t%s\n" % (k, '\t'.join(d[k])))

Here's an example of how you could use this script. Let's say this script is called collapse.py:

$ ./collapse.py < expression.txt
hsa-mir-1229    1.197748        1.000000
hsa-mir-3163    0.000000        0.800000
hsa-mir-3175    0.000000        0.030000
hsa-mir-3189    0.149719        0.149719
hsa-mir-3619    3.144089        3.144089
hsa-mir-3646    0.149719        0.449719
hsa-mir-3658    0.000000        0.490719
hsa-mir-3662    1.497185        0.497185
hsa-mir-3928    1.347467        NA
hsa-mir-4262    0.000000        0.700000

This script assumes that you don't need the hsa-mir-* rows printed in any particular order. If you need ordering, you can either sort the output with Unix sort or use an ordered dictionary in Python. Feel free to clarify your question, if so.

ADD COMMENTlink modified 12 months ago • written 12 months ago by Alex Reynolds28k
0
gravatar for Chris
12 months ago by
Chris30
Chris30 wrote:

Hi all, thank you so much for your valuable answers.

ADD COMMENTlink written 12 months ago by Chris30

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLYlink modified 12 months ago • written 12 months ago by RamRS21k

Also, this is not an answer. After you provide feedback on the other answers, please delete this post:

  • Click on moderate (link next to the Add Comment button)
  • Choose Delete Post
  • Click on the blue Submit button.
ADD REPLYlink written 12 months ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1361 users visited in the last hour