Question

To Create Protovis Sunburst Charts : Python Script To Create Dataset In Json Format (Or) Parent - Child Json

4

Entering edit mode

11.0 years ago

ram.dsramesh ▴ 40

Help!!! As a biologist I am just interested in visualizing and displaying my data and I am very new to programming. Here I have a set of data in a excel file, which looks like this-

data set

I guess it's very perfect to display my data set using Sunburst chart in Protovis.

But I have stuck with preparing the data, which has to be in json format. If you notice that the structure of the data is hierarchical (Parent - Child hierarchy). Being not so good in programming (just know a little bit of Python) it's difficult to go ahead.

I need a python script which can read my excel file and generate a json as specified above.

In my data set, there is a parent and child relationship. L1 is Parent to L2 and L2 is parent to L3, so on...

>L1 (PARENT) - L2 (CHILD)
>L2 (PARENT) - L3 (CHILD)
>L3 (PARENT) - L4 (CHILD)
>L4 (PARENT) - L5 (CHILD)
>L5 (PARENT) - GENE_NAME (CHILD)

sunbusrt chat

Hope I can get my data set visualized in the above format. But I should have my data-set in the json format specified in here

I was looking to display my data something like this.

MY_IMAGE

Any sort of help appreciated.

python • 21k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 11.0 years ago by ram.dsramesh ▴ 40

score 7 · Answer 1 · 2013-04-03

Open your data in Excel and save it as a CSV file.

L1,L2,L3,L4,L5,GENE_NAME
Enzyme,Kinase,Protein Kinase,Ser_Thr,Cmgc,MAPK11
Enzyme,Kinase,Protein Kinase,Tyr,Tk,ABL1
Enzyme,Kinase,Protein Kinase,Tyr,Tk,PDGFRB
Enzyme,Kinase,Protein Kinase,Tyr,Tk,PDGFRA
Enzyme,Kinase,Protein Kinase,Ser_Thr,Tkl,ALK
Enzyme,Isomerase,Isomerase Other,,,gyrB
Enzyme,Oxidoreductase,Oxidoreductase Other,,,ALOX5
Enzyme,Oxidoreductase,Oxidoreductase Other,,,IMPDH1
Enzyme,Transferase,Transferase Other,,,COMT
Enzyme,Oxidoreductase,Oxidoreductase Other,,,RRM1
Enzyme,Oxidoreductase,Oxidoreductase Other,,,PTGS2
Enzyme,Lyase,Lyase Other,,,POLB
Enzyme,Lyase,Lyase Other,,,CA5B
Enzyme,Hydrolase,Hydrolase Other,,,GAA
Enzyme,Protease,Metallo,MAM,M10A,MMP8
Enzyme,Lyase,Lyase Other,,,CA5A
Enzyme,Lyase,Lyase Other,,,CA7

You can do the rest in Python:

import csv
import json
import sys

tree = {}

reader = csv.reader(open(sys.argv[1], 'rb'))
reader.next() 
for row in reader:
    subtree = tree
    for i, cell in enumerate(row):
        if cell:
            if cell not in subtree:
                subtree[cell] = {} if i<len(row)-1 else 1
            subtree = subtree[cell]

print json.dumps(tree, indent=4)

Save the script as csv2json.py and run it:

python csv2json.py test.csv

It gives you:

{
    "Enzyme": {
        "Protease": {
            "Metallo": {
                "MAM": {
                    "M10A": {
                        "MMP8": 1
                    }
                }
            }
        }, 
        "Isomerase": {
            "Isomerase Other": {
                "gyrB": 1
            }
        }, 
        "Kinase": {
            "Protein Kinase": {
                "Tyr": {
                    "Tk": {
                        "ABL1": 1, 
                        "PDGFRB": 1, 
                        "PDGFRA": 1
                    }
                }, 
                "Ser_Thr": {
                    "Tkl": {
                        "ALK": 1
                    }, 
                    "Cmgc": {
                        "MAPK11": 1
                    }
                }
            }
        }, 
        "Transferase": {
            "Transferase Other": {
                "COMT": 1
            }
        }, 
        "Lyase": {
            "Lyase Other": {
                "CA5B": 1, 
                "CA5A": 1, 
                "CA7": 1, 
                "POLB": 1
            }
        }, 
        "Oxidoreductase": {
            "Oxidoreductase Other": {
                "PTGS2": 1, 
                "ALOX5": 1, 
                "IMPDH1": 1, 
                "RRM1": 1
            }
        }, 
        "Hydrolase": {
            "Hydrolase Other": {
                "GAA": 1
            }
        }
    }
}

score 3 · Answer 2 · 2013-04-03

Here is a python script I wrote a while back to produce JSON data for D3.js sunburst diagram which is similar to protovis (same author). What's nice about python is that printing data variables as string is basically JSON. You need to get your data into tab delimited format. You might have to modify the script a little to get it to work with protovis.

import sys
dataStructure = {}
for line in open(sys.argv[1],'r'):
    data = line.strip().split()

    current = dataStructure
    for item in data[:-2]:
        if not current.has_key(item):
            current[item] = {}

        current = current[item]
    if not current.has_key(data[-2]):
        current[data[-2]] = 1
    else:
        current[data[-2]] += 1
print 'var data = ' + str(dataStructure)

Save as script.py and run by:

python script.py myData.tabdelimited > myData.json

For example, here are some sample data:

A    1    F    gene1
A    1    F    gene2
A    2    G    gene3
A    2    G    gene4
A    2    H    gene5
B    3    I    gene6
C    4    J    gene7
C    5    K    gene8
D    6    L    gene9
D    6    M    gene10
D    6    L    gene11

Here is the output of the script:

var data = {'A': {'1': {'F': 2}, '2': {'H': 1, 'G': 2}}, 'C': {'5': {'K': 1}, '4': {'J': 1}}, 'B': {'3': {'I': 1}}, 'D': {'6': {'M': 1, 'L': 2}}}

Ram · Answer 3 · 2014-11-16

Working on a Macintosh, I found it helpful to replace the:

 'reader = ...'

line with:

reader = csv.reader(open("filename.csv", 'rU'), quotechar='"', delimiter = ',')

This got me past the:

new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Note this specifies the name of the csv file rather than expecting it as an argument from the command line.