Question: How can I parse a json file from MobiDB to retrieve proteome data?
0
gravatar for Jason2
7 days ago by
Jason20
United States
Jason20 wrote:

I've downloaded the mobidb database human data in its json format. It contains protein disorder information for regions of each protein. The issue I'm having is converting the json format nested structure into a table.

The file can be downloaded here click on "json" next to human

Here's an example of two proteins:

{  "acc" : "O43760",  "sequence" : "MESGAYGAAKAGGSFDLRRFLTQPQVVARAVCLVFALIVFSCIYGEGYSNAHESKQMYCVFNRNEDACRYGSAIGVLAFLASAFFLVVDAYFPQISNATDRKYLVIGDLLFSALWTFLWFVGFCFLTNQWAVTNPKDVLVGADSVRAAITFSFFSIFSWGVLASLAYQRYKAGVDDFIQNYVDPTPDPNTAYASYPGASVDNYQQPPFTQNAETTEGYQPPPVY",  "ncbi_taxon_id" : 9606,  "organism" : "Homo sapiens (Human)",  "mobidb_consensus" : {  "disorder" : {  "predictors" : [ { "regions" : [ [ 1, 19, "D" ], [ 46, 54, "D" ], [ 96, 100, "D" ], [ 178, 224, "D" ] ], "method" : "simple" }, { "regions" : [ ], "dc" : 0, "method" : "mobidb-lite", "scores" : [ 0.625, 0.75, 0.75, 0.5, 0.625, 0.625, 0.5, 0.375, 0.5, 0.5, 0.375, 0.375, 0.375, 0.25, 0.25, 0.25, 0.25, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0.25, 0.375, 0.375, 0.375, 0.375, 0.5, 0.5, 0.5, 0.375, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.625, 0.5, 0.5, 0.625, 0.75, 0.75, 0.75, 0.75, 0.875, 0.75, 0.875, 0.625, 0.625, 0.875, 0.875, 0.875, 0.75, 0.75, 0.75 ] } ] } } }
{  "acc" : "Q92728",  "sequence" : "MPPKTPRKTAATAAAAAAEPPAPPPPPPPEEDPEQDSGPEDLPLVRGSITNGR",  "ncbi_taxon_id" : 9606,  "organism" : "Homo sapiens (Human)",  "mobidb_consensus" : {  "disorder" : {  "predictors" : [ { "regions" : [ [ 1, 53, "D" ] ], "method" : "simple" }, { "regions" : [ [ 1, 53, "D_WC" ] ], "dc" : 1, "method" : "mobidb-lite", "scores" : [ 1, 1, 0.875, 1, 1, 0.875, 0.75, 0.875, 0.75, 0.75, 0.75, 0.75, 0.875, 0.875, 0.875, 0.875, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.875, 0.875, 0.875, 0.875, 0.875, 0.875, 1, 1, 1, 1, 1, 1, 1, 0.875, 0.875, 0.875, 0.875, 0.875, 0.75, 0.75, 0.875, 1, 0.875, 0.875, 1, 1, 0.875, 0.875 ] } ] } } }

This type of structure is repeated thousands of times to capture information across many proteins.

I would like to format the data so that it looks like a simple table where I have columns for acc ID and disorder predictor regions :

ID, region_start, region_end, disorder_type

O43760, 1, 19, "D",

O43760, 46, 54, "D" ,

O43760, 96, 100, "D",

O43760, 178, 224, "D",

Q92728, 1, 53, "D",

I know how to use terminal (i.e. downloaded software, awk) fairly well and I also use R, so if you could recommend solutions using those tools it would be greatly appreciated but if you know of another way that's fine. I have been playing around with jq and jt but haven't succeeded in using them to address this problem yet.

Any help would be appreciated! Thanks!

ADD COMMENTlink modified 7 days ago by vkkodali860 • written 7 days ago by Jason20
2
gravatar for vkkodali
7 days ago by
vkkodali860
United States
vkkodali860 wrote:

If you can use python, there is a module called json that can deal with this. Check out https://docs.python.org/3/library/json.html and specifically the 'Decoding JSON' part.

You can use the quick-and-dirty script shown below:

Usage:

./disorder_to_tbl.py disorder_UP000005640.mjson.gz > output_table.tsv

At least for the human file you have pointed to, I did not encounter any errors.

ADD COMMENTlink modified 7 days ago • written 7 days ago by vkkodali860

Thanks! However, I really need to do this for the proteome so hard coding each protein would be difficult. Is there a way to loop over each protein in a high throughput manner?

ADD REPLYlink written 7 days ago by Jason20
1

I updated my answer to change it to a script that you can use.

ADD REPLYlink written 7 days ago by vkkodali860
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1230 users visited in the last hour