I am trying to run a ~800 residues protein (WP_086691035.1 colicin uptake protein [Nostoc sp. T09]) through the AlphaFold2 ipython notebook on colab.
However, I am running out of memory. Shorter sequences work fine and I remember running a sequence of similar length earlier. Do you know if there is a way of getting a successful run at some point in time, for example when the load is lower e.g. during the weekend? I know there a payed plan but I would only use it for running the notebook and it is not clear how much resources each plan actually has. It just says that the payed plan has "more" CPU and RAM but that is not that helpful.
running model_1
---------------------------------------------------------------------------
UnfilteredStackTrace Traceback (most recent call last)
<ipython-input-13-af48741e914e> in <module>()
50 model_params=model_params, use_model=use_model,
---> 51 do_relax=use_amber)
13 frames
UnfilteredStackTrace: RuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 10409975032 bytes.
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py in _execute_compiled(compiled, avals, handlers, kept_var_idx, *args)
958 for i, x in enumerate(args)
959 if x is not token and i in kept_var_idx))
--> 960 out_bufs = compiled.execute(input_bufs)
961 check_special(xla_call_p.name, out_bufs)
962 return [handler(*bs) for handler, bs in zip(handlers, _partition_outputs(avals, out_bufs))]
RuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 10409975032 bytes.
This is the domain distribution in your protein according to Pfam:
It seems that all the relevant domains are approximately in the 50-700 range, which you should be able to fold on Colab.
Thanks again for your quick reply, I will check breaking the sequence up. The point is that I am trying to show that this protein is in fact an ortholog of something else (MVP), so it would be best to be able to model the full sequence if at all possible. I will also use i-Tasser to compare the models.
This protein is definitely a relative of MVP, but don't know whether it can be proven to be an ortholog, with or without a structure. If you can model 80% of the protein, you will likely have the same argument one way or the other as you would have with a whole protein.
Here, I randomly picked an MVP from T. cruzi and you can see all the details at this link. There is already an AlphaFold structure for that protein and many that are alike, and its domain structure it is nearly identical to your protein's. Like I said, they are definitely related, though it is up to you to figure out the orthology.
Thank you, if possible I would like to avoid breaking the protein up, because I don't know the exact breakpoint, And if I get it wrong, the prediction might be way off. But maybe a little stupid question: if I copy the source code from the notebook and run it - as a python script - on our server, would that work? Or is there something that only works on colab?
It works just fine locally, but their estimate of the required disk space is way off if you plan to run it as a Docker container - which is the suggestion on their GitHub. They estimate 2.2 Tb of disk space, while the Docker container will need at least double that if not more. Also, it will take a couple of days to download everything because their scripts are quite inefficient.
I suggest running it as a non-docker setup, which is described here. That one takes a single download and unpacking, and doesn't have all the Docker baggage so the requirement is close to 2.5 Tb of disk space.
It seems I have better success with the advanced notebook AlphaFold2_advanced.ipynb. It has already computed 3 of 5 models, but maybe I have just been lucky.
Anyway, I will try to buy a 3 TB scratch volume to try to install it myself. I am guessing an SSD will be best?
If you are doing this on the large scale and speed is important, SSD is definitely a way to go. Beware that 2.2 Tb requirement may not be accurate when using Docker - or maybe I don't know how to set up a Docker container without needing extra disk space.
Be sure to use aria2 for downloads, and even that won't help much when downloading PDB files. That part took about two days.
I run AlphaFold2 from a regular hard disk, with NVIDIA GeForce GTX 1080 (8 Gb). It takes about 90 minutes for a 400 aa protein.