Question: How can I parse a json file from MobiDB to retrieve proteome data?
0
gravatar for Jason2
8 months ago by
Jason20
United States
Jason20 wrote:

I've downloaded the mobidb database human data in its json format. It contains protein disorder information for regions of each protein. The issue I'm having is converting the json format nested structure into a table.

The file can be downloaded here click on "json" next to human

Here's an example of two proteins:

{  "acc" : "O43760",  "sequence" : "MESGAYGAAKAGGSFDLRRFLTQPQVVARAVCLVFALIVFSCIYGEGYSNAHESKQMYCVFNRNEDACRYGSAIGVLAFLASAFFLVVDAYFPQISNATDRKYLVIGDLLFSALWTFLWFVGFCFLTNQWAVTNPKDVLVGADSVRAAITFSFFSIFSWGVLASLAYQRYKAGVDDFIQNYVDPTPDPNTAYASYPGASVDNYQQPPFTQNAETTEGYQPPPVY",  "ncbi_taxon_id" : 9606,  "organism" : "Homo sapiens (Human)",  "mobidb_consensus" : {  "disorder" : {  "predictors" : [ { "regions" : [ [ 1, 19, "D" ], [ 46, 54, "D" ], [ 96, 100, "D" ], [ 178, 224, "D" ] ], "method" : "simple" }, { "regions" : [ ], "dc" : 0, "method" : "mobidb-lite", "scores" : [ 0.625, 0.75, 0.75, 0.5, 0.625, 0.625, 0.5, 0.375, 0.5, 0.5, 0.375, 0.375, 0.375, 0.25, 0.25, 0.25, 0.25, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125, 0.125, 0.125, 0.125, 0.125, 0.25, 0.375, 0.375, 0.375, 0.375, 0.5, 0.5, 0.5, 0.375, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.625, 0.5, 0.5, 0.625, 0.75, 0.75, 0.75, 0.75, 0.875, 0.75, 0.875, 0.625, 0.625, 0.875, 0.875, 0.875, 0.75, 0.75, 0.75 ] } ] } } }
{  "acc" : "Q92728",  "sequence" : "MPPKTPRKTAATAAAAAAEPPAPPPPPPPEEDPEQDSGPEDLPLVRGSITNGR",  "ncbi_taxon_id" : 9606,  "organism" : "Homo sapiens (Human)",  "mobidb_consensus" : {  "disorder" : {  "predictors" : [ { "regions" : [ [ 1, 53, "D" ] ], "method" : "simple" }, { "regions" : [ [ 1, 53, "D_WC" ] ], "dc" : 1, "method" : "mobidb-lite", "scores" : [ 1, 1, 0.875, 1, 1, 0.875, 0.75, 0.875, 0.75, 0.75, 0.75, 0.75, 0.875, 0.875, 0.875, 0.875, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.875, 0.875, 0.875, 0.875, 0.875, 0.875, 1, 1, 1, 1, 1, 1, 1, 0.875, 0.875, 0.875, 0.875, 0.875, 0.75, 0.75, 0.875, 1, 0.875, 0.875, 1, 1, 0.875, 0.875 ] } ] } } }

This type of structure is repeated thousands of times to capture information across many proteins.

I would like to format the data so that it looks like a simple table where I have columns for acc ID and disorder predictor regions :

ID, region_start, region_end, disorder_type

O43760, 1, 19, "D",

O43760, 46, 54, "D" ,

O43760, 96, 100, "D",

O43760, 178, 224, "D",

Q92728, 1, 53, "D",

I know how to use terminal (i.e. downloaded software, awk) fairly well and I also use R, so if you could recommend solutions using those tools it would be greatly appreciated but if you know of another way that's fine. I have been playing around with jq and jt but haven't succeeded in using them to address this problem yet.

Any help would be appreciated! Thanks!

ADD COMMENTlink modified 7 months ago by ieuangw0 • written 8 months ago by Jason20

Can I ask how did you get your file to come in the json format, overtime I have tried to download it i can only get mjson format which does not load on R?

ADD REPLYlink written 7 months ago by ieuangw0

I doubt they ever had the files in json format. At least when I wrote the script shown below, the files were in mjson format (see Usage portion below; the filename as mjson in it).

ADD REPLYlink written 7 months ago by vkkodali1.2k
2
gravatar for vkkodali
8 months ago by
vkkodali1.2k
United States
vkkodali1.2k wrote:

If you can use python, there is a module called json that can deal with this. Check out https://docs.python.org/3/library/json.html and specifically the 'Decoding JSON' part.

You can use the quick-and-dirty script shown below:

Usage:

./disorder_to_tbl.py disorder_UP000005640.mjson.gz > output_table.tsv

At least for the human file you have pointed to, I did not encounter any errors.

ADD COMMENTlink modified 8 months ago • written 8 months ago by vkkodali1.2k

Thanks! However, I really need to do this for the proteome so hard coding each protein would be difficult. Is there a way to loop over each protein in a high throughput manner?

ADD REPLYlink written 8 months ago by Jason20
1

I updated my answer to change it to a script that you can use.

ADD REPLYlink written 8 months ago by vkkodali1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1754 users visited in the last hour