Question: How can I parse a json file from MobiDB to retrieve proteome data?
0
gravatar for Jason2
9 weeks ago by
Jason20
United States
Jason20 wrote:

I've downloaded the mobidb database human data in its json format. It contains protein disorder information for regions of each protein. The issue I'm having is converting the json format nested structure into a table.

The file can be downloaded here click on "json" next to human

Here's an example of two proteins:

{  "acc" : "O43760",  "sequence" : "MESGAYGAAKAGGSFDLRRFLTQPQVVARAVCLVFALIVFSCIYGEGYSNAHESKQMYCVFNRNEDACRYGSAIGVLAFLASAFFLVVDAYFPQISNATDRKYLVIGDLLFSALWTFLWFVGFCFLTNQWAVTNPKDVLVGADSVRAAITFSFFSIFSWGVLASLAYQRYKAGVDDFIQNYVDPTPDPNTAYASYPGASVDNYQQPPFTQNAETTEGYQPPPVY",  "ncbi_taxon_id" : 9606,  "organism" : "Homo sapiens (Human)",  "mobidb_consensus" : {  "disorder" : {  "predictors" : [ { "regions" : [ [ 1, 19, "D" ], [ 46, 54, "D" ], [ 96, 100, "D" ], [ 178, 224, "D" ] ], "method" : "simple" }, { "regions" : [ ], "dc" : 0, "method" : "mobidb-lite", "scores" : [ 0.625, 0.75, 0.75, 0.5, 0.625, 0.625, 0.5, 0.375, 0.5, 0.5, 0.375, 0.375, 0.375, 0.25, 0.25, 0.25, 0.25, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0.25, 0.375, 0.375, 0.375, 0.375, 0.5, 0.5, 0.5, 0.375, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.625, 0.5, 0.5, 0.625, 0.75, 0.75, 0.75, 0.75, 0.875, 0.75, 0.875, 0.625, 0.625, 0.875, 0.875, 0.875, 0.75, 0.75, 0.75 ] } ] } } }
{  "acc" : "Q92728",  "sequence" : "MPPKTPRKTAATAAAAAAEPPAPPPPPPPEEDPEQDSGPEDLPLVRGSITNGR",  "ncbi_taxon_id" : 9606,  "organism" : "Homo sapiens (Human)",  "mobidb_consensus" : {  "disorder" : {  "predictors" : [ { "regions" : [ [ 1, 53, "D" ] ], "method" : "simple" }, { "regions" : [ [ 1, 53, "D_WC" ] ], "dc" : 1, "method" : "mobidb-lite", "scores" : [ 1, 1, 0.875, 1, 1, 0.875, 0.75, 0.875, 0.75, 0.75, 0.75, 0.75, 0.875, 0.875, 0.875, 0.875, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.875, 0.875, 0.875, 0.875, 0.875, 0.875, 1, 1, 1, 1, 1, 1, 1, 0.875, 0.875, 0.875, 0.875, 0.875, 0.75, 0.75, 0.875, 1, 0.875, 0.875, 1, 1, 0.875, 0.875 ] } ] } } }

This type of structure is repeated thousands of times to capture information across many proteins.

I would like to format the data so that it looks like a simple table where I have columns for acc ID and disorder predictor regions :

ID, region_start, region_end, disorder_type

O43760, 1, 19, "D",

O43760, 46, 54, "D" ,

O43760, 96, 100, "D",

O43760, 178, 224, "D",

Q92728, 1, 53, "D",

I know how to use terminal (i.e. downloaded software, awk) fairly well and I also use R, so if you could recommend solutions using those tools it would be greatly appreciated but if you know of another way that's fine. I have been playing around with jq and jt but haven't succeeded in using them to address this problem yet.

Any help would be appreciated! Thanks!

ADD COMMENTlink modified 5 weeks ago by ieuangw0 • written 9 weeks ago by Jason20

Can I ask how did you get your file to come in the json format, overtime I have tried to download it i can only get mjson format which does not load on R?

ADD REPLYlink written 5 weeks ago by ieuangw0

I doubt they ever had the files in json format. At least when I wrote the script shown below, the files were in mjson format (see Usage portion below; the filename as mjson in it).

ADD REPLYlink written 5 weeks ago by vkkodali980
2
gravatar for vkkodali
9 weeks ago by
vkkodali980
United States
vkkodali980 wrote:

If you can use python, there is a module called json that can deal with this. Check out https://docs.python.org/3/library/json.html and specifically the 'Decoding JSON' part.

You can use the quick-and-dirty script shown below:

Usage:

./disorder_to_tbl.py disorder_UP000005640.mjson.gz > output_table.tsv

At least for the human file you have pointed to, I did not encounter any errors.

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by vkkodali980

Thanks! However, I really need to do this for the proteome so hard coding each protein would be difficult. Is there a way to loop over each protein in a high throughput manner?

ADD REPLYlink written 9 weeks ago by Jason20
1

I updated my answer to change it to a script that you can use.

ADD REPLYlink written 9 weeks ago by vkkodali980
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1976 users visited in the last hour