alphafold online availability and use case
1
0
Entering edit mode
12 months ago

I'm new to both protein structure prediction and the use of AI-based tools like Alphafold2 or RoseTTAFold. And I have a few questions:

1. Is it possible to use structure prediction by AlphaFold2 to validate HMMER based domain sequence predictions? If yes, what would be the steps? I have some idea, but not sure if it will work, and seek your input / advice?

2. For predicting structure of a protein domain sequence, should I feed the software

• the entire protein sequences, or

• just the domain sequence, or

• domain sequence + some bordering additional aa residues?

3. Currently for AlphaFold2, what are ready-to-use virtual machines that can be launched as an end-user, rather than have me fumble with installation etc?

4. Is there any cloud-based server that provides not just approximate predictions, but results that i would obtain using the full install? For example, some of the google free-to-use python co-lab pages are not full implementations AFAIK, that use a shortcut for the computationally intensive MSA step, and so wouldn't these predictions be not as accurate?

5. Would the prediction pipeline you suggest also be practical for high throughout prediction of structure to validate my domain predictions, where I can automate submission for ~1500 sequences (either full length protein or domain-only sequences) Is it even possible to automate submission to some of the free-to-use Google's Co-lab Py notebooks (from Martin Steinegger's group)?

alphafold structure prediction • 2.9k views
1
Entering edit mode

Not my area of expertise particularly but;

1. I don't think you can use a structure prediction tool to really 'validate' HMMER predictions. I'm pretty sure most structure predictors are relying on HMMER or similar HMM based approaches (Martin told me AlphaFold leans on HHBlits API calls for example). I would argue that the HMM's are probably 'closer to the raw data' and really you're validating a prediction with HMMs rather than the other way around - though this is a fairly nitpicky point (and all data in concert together is never a bad thing).

2. Generally the whole sequence, but there's no one size fits all answer. If its a very large protein, or has domains that are not well characterised, then you may get better results feeding in separate domains. The predictor ITASSER, for instance, actually has a hardcoded limit of 1500 AAs, so if your protein is too large, you'd have no choice but to break it up. The larger the sequence you submit, the more likely the simulation will be of lower accuracy generally.

3. Not really sure on this, so others may weigh in. I believe the github page offers a Docker image with alphafold ready to run. I have a vague recollection of there being a webserver available but don't know for certain offhand.

4. I'm not aware of 'short-cuts' per se (but again, not my area of expertise). As mentioned, I know part of the alphafold API calls out to the HHSuite set of tools. I don't know that this would be any less accurate though, as HHSuite is a very good set of HMM and alignment tools. I can't imagine you could hope for much better alignment/templating with other approaches/tools.

5. Protein prediction remains a very computationally costly task, so the only real way you'd ever obtain structures for 1500 sequences in a short space of time would be to run the software locally on infrastructure you control. I'm aware of Martin's Co-lab set up, but afraid I don't know whether you could easily mass-auto-submit jobs.

0
Entering edit mode

Joe Thanks a TON for your point by point replies. Please find below my comments / thoughts.

1. HMMER3 based searches (using hmmsearch or pfamscan) yield sequence matches with varying E-values and bitscore and length variation. Some of these sequence predictions may not be 'true positives'. And therefore, I started thinking of using structure predictions to classify my HMMER3-based sequence predictions into 'false positives', 'true positives' and 'uncertain' categories.

2. I have done 3D superposition of available PDBs for this protein domain, and they overlap well, even at ~ < 20% pairwise identity - this is not surprising though, but only confirms structure is better conserved than sequence. And so my idea is this - if I set conditions such as these - if more than half of the 3 helices in the structure are missing, AND/OR contact resides that allow physical interaction with their protein partner(s) are missing, I would classify as 'uncertain' or 'false positive'. This idea is still amorphous and evolving. What do you think?

3. On Google CoLab Py notebook, that runs a simplified version of AlphaFold2 [or was it RoseTTAFold ? - can't remember, but it may be an important distinction] - 1 submission took ~10 minutes to return the results. This is super quick compared to most other methods... So its not total computational time that is a challenge, but the impracticality of submitting 1500 entries manually, one by one... Are there any workarounds to this?

Best. Charlotte

1
Entering edit mode
1. You could certainly try it - but I'm not sure if this is going to achieve the objective. If your HMM matches have E-values that are sufficiently low (1e-06 is a pretty common default), I wouldn't be hugely worried that they are false positives. If you have hits that have poor E-values, then this would suggest there isn't really much structural information available to match to, and in all likelihood, you'd get poor structure simulations anyway (so you'd just end up comparing junk to junk).

2. Without knowing what the proteins are, I can't really say whether this sounds like a reasonable set of criteria or not. If you have some pre-existing biological insight as to what the 'important' features of these proteins are, then sure you can probably draw up some criteria like that. I would be inclined to use (or at least start with) some more generalist and objective metrics of structure similarity, for example TM score).

1. I'm afraid I don't know the tool well enough to say. Agree with Mensur, that if you were batch submitting many jobs though, it's considered good 'etiquette' to obtain permission from the server operators first as you'll be filling up the queue for other users too. It's not something I know how to do particularly, but you can always write scripts that interact with the webpage directly in a kind of hacky way - though the better suggestion would be to have a good read of the Co-lab software's documentation and see if they mention an API you can interact with to submit batch jobs or something.
0
Entering edit mode

Joe - will look into your suggestions, thank you very much :)

BTW, for pairwise comparison of structures, is TM-score currently the best option? Is it computationally intensive for a local install and run? If yes, are there 'lighter' alternatives for a quicker and dirtier initial analyses of my predicted structures? TIA!

2
Entering edit mode

TMscore is a single C++ program that compiles into a single file:

g++ -O3 -o TMscore TMscore.cpp -lm


It takes about a second to run for a pair of structures.

0
Entering edit mode

OK, that's super fast . So no software dependencies? Doesn't seem to need other software based on quick look at https://zhanglab.dcmb.med.umich.edu/TM-score/, TIA!

1
Entering edit mode

From my (albeit limited) experience, I believe TM-score to be one of the best current metrics (as it overcomes some of the limitations of RMSD). I don't recall it being particularly computationally intensive when I was running it on around 1600 proteins.

0
Entering edit mode

Little update, Martin confirmed that, at the moment, direct batch submission isn't possible:

0
Entering edit mode
• Thank you Joe for finding this out. I couldn't possibly do my submissions one-by-one for such a large dataset > 1000 domain sequence predictions,
• It's easy to identify (for example) top ~25 most divergent sequence predictions from my set of 1000+ seqs
• So I'm thinking I'll predict using AF2, only these 25 structures
• Then compare each of these 25 predictions to known PDBs to see if they superimpose as would a true protein domain with conserved 3D conformation.

Does that sound reasonable? Do you think there are hurdles to even this sort of a simple(r) approach?

PS. apologies for my delayed reply, I think Biostars email notification went to spam!

0
Entering edit mode

Are you sure you really need to model all 1000? Your approach of reducing the dataset through doing some clustering or something sounds sensible.

Do you have access to any local HPC or workstations you can run simulations on locally? During my PhD I ran simulations on many hundreds of sequences using a local ITASSER install on a pretty standard server (32 cores, 100gb RAM)

WHich brings me to my next question - do you have to use AlphaFold? It's accurate sure, but for many proteins for which there is already good experimental structural data, ITASSER etc will still perform very well (and may be faster, I don't know).

Otherwise the approach you mention sounds reasonable to me.

0
Entering edit mode
• Yeah, I agree with you 100%, like I mentioned above, just stick with 10-20 top divergent sequences based on sequence clustering (not 1000s in the full dataset)
• Predict 3D structure for these "wildly divergent sequences" using ITASSER / AlphaFold / RoseTTAFold or whichever tool makes sense - given my time / computational constraints - yes I do have university HPC access, but not at a level where I can install and run AF2 though! :)
• Check to see if 3D overlaps of these predictions suggest whether they belong to the same type of fold or not, by comparing predictions to 1 or more available PDBs
• Use structure predictions to classify my predicted sequences as high vs. low chance of being true positives

So with your help and others in this post, my analyses pipeline and strategy is decided, thank you all v. much :)

0
Entering edit mode

You mean notebooks like https://colab.research.google.com/github/deepmind/alphafold/blob/main/notebooks/AlphaFold.ipynb, https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb ? According to their own colab page - "While accuracy will be near-identical to the full AlphaFold system on many targets, a small fraction have a large drop in accuracy due to the smaller MSA and lack of templates", looks like this is one of the best possible alternatives to downloading and installing AlphaFold2 on local system as it takes up about 2.2TB space for databases

3
Entering edit mode
12 months ago
Mensur Dlakic ★ 20k
1. There is no need for heavy-duty methods such as AlphaFold2 (AF2) in all cases. It is very unlikely that you have 1500 sequences that only have domains of unknown function, and even if you do, there were successful structure prediction servers in existence before AF2. So even though what you want could be done, it most likely is an overkill. Don't know if you are aware, but for most Pfam domains structures are already available on Pfam web site, and for many sequences within each group. For example, here are AF2 models for one of my favorite domains.

2. Tough to generalize. For domains that are completely independent folding units, it would be enough to feed only its sequence. I have done this for a very small domain ( < 50 residues) and AF2 folds it identically when submitted independently or as a part of the whole sequence. However, for domains that make some contacts with the rest of protein, full sequence may be required. I think it is safe to start just with a domain sequence, or at most include 5-10 residues on each side. The latter suggestion is not because AF2 may need it, but because domain assignments are often off. Beware that this way you may end up with floppy tails at the each end of your sequence.

3. None that I know. There is a Docker container, but it required more disk space than what they advertise, and downloading all PDB structures using their scripts took more than 2 days. Their calculation is 2.2 Tb initially to download everything, but it takes at least 2-3x that much to install a Docker container (and lots of patience). So you can download everything on one disk and install the container on the other, but you will still need a big and reasonably fast disk. Eventually I installed it without Docker as described here, but that took some fiddling and likely does not fit your definition of a no-fuss installation. It works nicely and is very fast for my needs, though one could say that I have resources closer to the high end (a GPU and 40 CPUs).

4. Don't know beyond what is listed on AF2 Github page. Something will probably be available shortly (within one funding cycle, so let's say 6-8 months). You can get in queue for RoseTTAfold and that takes about a week.

5. Even if it is possible, I would never submit 1500 sequences to a server without contacting the group that maintains it. Color me cynical, but I think they would not approve such a request. I think this would be in part because of fair-resource use, and also because it is most likely an overkill as I mentioned in my first point.

0
Entering edit mode

PS Many pre-calculates structures for proteins of model organisms are available here.

0
Entering edit mode

Mensur Dlakic - Thanks a lot for your point -by-point replies, greatly appreciated!

1. No, I Was not aware that AF2 predictions are now added to Pfam under tab9 for each Pfam ID, this is very useful for me to know. Awesome!

2. Are floppy tails inevitable at the ends of structure predictions? Regardless of input sequence length and constraints placed by template structure? And is floppiness increased for template-free predictions? How to minimize / eliminate floppiness?

3. Is uncertainty in domain boundaries unique to or more prevalent in Pfam pHMM + HMMER3 based predictions or equally so in other related methods such as InterProScan etc., or more sensitive methods such as HHblits?

4. Even the docker is too much for my limited resources - 2+TB SSD, 1 GPU, 40CPUs - at this point, I can only imagine. So I'll look more into Google CoLab websites / email DeepMind AF2 group / email Pfam to see if there's anything they are doing I can benefit from...

5. Is there an programmatic way to extract only those lines in a PDB file for JUST the Pfam domain coords that I am interested in? Could you please share details? Thanks again.

0
Entering edit mode

0
Entering edit mode

The easiest thing by far is to find out if your sequences of interest have already been modeled. If you know the SwissProt/UniRef numbers for your sequences, you can simply enter them at https://alphafold.ebi.ac.uk/ For example, here is an AlphaFold model for the sequence P11369. There is a very good chance that at least some of your sequences already have models, so the only thing you need to do is find their accession numbers. There is even greater chance that you don't need to model most of your sequences, since many of them probably have the same domains (thus the same folds). If that is the case, I'd focus on clustering the sequences and modeling only one representative from each group.

Floppy tails are not inevitable, though they will probably be more likely to occur for template-free models. I would not worry too much about it. If they are not esthetically pleasing, you can always delete those residues from PDB models.

When domains are determined manually - when a person is inspecting and adjusting the alignments - there is usually no problem with domain boundaries. In many instances, however, domains are assigned automatically by programs that trim the alignments, and they are not always precise.

I don't know the answer to 5.

And Emily, people who answer questions here do so as volunteers. Calling someone to answer a question - even when done politely - is not a way to go. It signifies entitlement, and nobody here owes answers to others. The reason I didn't answer your second query is because I'd be repeating myself (I already told you about accessing existing AlphaFold models) or because I didn't think my answers would be particularly productive (floppy-tail modeling). I think it is good for all of us who are interested in learning to try some things on our own rather than wait for an answer.

0
Entering edit mode

Thank you so very much for your point by point replies, once again. I feel highly appreciative of you sharing your time and insights, and not the least bit entitled, I feel very sorry that you think so.