Welcome to a webinar:
- John Mitchell "Kaggle Competition Review: Novozymes Enzyme Stability Prediction"
- Friday 18 August, 18.00 (CET time, e.g. Paris time)
Welcome to the webinar on Friday: John Mitchell who is expert both in bioinformatics and machine learning, and also experienced Kaggler and one of the top participants of CAFA5 , will share his experience on the past Kaggle competition "Novoenzymes" which in certain respects similar to CAFA5 challenge.
Abstract: Kaggle's Novozymes Enzyme Stability Prediction was a challenging competition that rewarded expertise in bioinformatics as much as in Machine Learning. Two issues that made the competition particularly difficult were that the training data was rather different from the test data, and that there was no obvious or easy local validation protocol available. A wide variety of features and models proved valuable, while ensembling was generally considered essential. The nature of the competition lent itself to overfitting, and it was unsurprising that there was a large shake-up between Public LB and Private LB ranks. I discuss the experience gained from this competition and consider how the lessons learned might be applicable to other bioinformatic competitions.
Zoom link will be available on Kaggle forum 5 minutes before the start.
Video record will be on youtube later on the channel: https://www.youtube.com/@SciBerloga
Henrik Nielsen,Vineet Thumuluri , José Juan Almagro Armenteros "DeepLoc 2.0: multi-label subcellular localization prediction using protein language models" Thursday 27 July, 19.00 (CET time)
The talk will be based on the recent paper with the same title (Nucleic Acids Res. 2022 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9252801/) ). The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.
Videos records are available : https://www.youtube.com/c/SciBerloga
Everybody is welcome to a research/educational webinar. The methods discussed might be useful for CAFA5 participants, although the webinar is NOT specific to CAFA5 challenge. However molecular descriptors developed by the authors see e.g. https://europepmc.org/article/med/29080875 might be used for feature generation - although the publication is devoted to the different task, might be some modification to CAFA5 challenge is possible - hopefully it might be more clear after the webinar.
Prof. Alexey Lagunin "Sequence-structure based bioinformatics." Monday 17 July, 17.00 (CET time)
Sequence-structure based bioinformatics approach is a new direction of bioinformatics which is based on representation of sequences as their structural formula and analysis of structure-property relationships using molecular descriptors and machine learning algorithms. Some cases using of this approach will be discussed during webinar.
About the reporter: Prof. A. Lagunin (https://scholar.google.ru/citations?user=qaSuEUkAAAAJ&hl ) authored numerous publications in the field of bioinformatics.
Trifonova Sofya "HMMER application to CAFA5." Friday 14 July, 17.00 (CET time)
HMMER ( http://hmmer.org/ ) is a sensitive method for finding protein domains (https://en.wikipedia.org/wiki/Protein_domain) . Domains in proteins define it functional features, that is why using this method could help in better predictive ability of ML models in CAFA5 Kaggle competition (https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/overview). For example, one may try to create 0-1 features enumerated by proteins domains: protein have particular domain (1) or not (0). And try to add them as predicators to the task.
Zoom link will be available in on Kaggle forum https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/discussion/418295 and telegram channel https://t.me/sberlogabig shortly before start of the talk.
Video records: https://www.youtube.com/c/SciBerloga
Marina Pak (Skoltech) "ProDDG (https://ivankovlab.ru/proddg)- database of data on protein stability change upon mutation (G) and predictors of mutation effect."
Friday 7 July, 19.00 (CET time)
ProDDG - https://ivankovlab.ru/proddg - is a web-service for developers, assessors and users of tools for predicting the effect of protein mutations. Using ProDDG you can:
search, filter, analyze and download ddG data download ready-to-use ddG datasets for training, testing and assessment find overlaps between G datasets based on protein sequence identities and compile leakage-free datasets discover popular tools for prediction of mutation effect I will talk about features of ProDDG and what problems the service solves, demonstrate use cases and share my experience of development of such projects.
CAFA 5 Protein Function Prediction Critical Assessment of Functional Annotation challenge is ongoing (2 months till end) on the Kaggle platform. The task is to predict gene-ontology terms for proteins based mainly on their structure. We will organize a free introductory webinar :
Liza Geraseva "A short review of methods used in CAFA challenges"
Monday 3 July, 18.00 (CET time, e.g. Paris time)
Previous webinar & recording:
Andrey Shevtsov "Introduction to CAFA5 Protein function prediction Kaggle competition"
Thursday 22 June, 19.00 (CET time, e.g. Paris time)
Video record of the webinar:
We would be happy to get in contact with those who are interested in the field, might share some experience widely around it and those who wanna try "to Kaggle" (hopefully provide some guidance for beginners). Please notify.