Time For a New Adventure
Written by: Pat Walters. PhD
Source: https://patwalters.github.io/Time-For-a-New-Adventure/
Practical Cheminformatics, Pat's legendary blog, has been a stage for the community. For years, its helped shape the way our field thinks about data and discovery, nudging the field toward transparency and excellent. As Pat begins his new chapter as Chief Scientist at OpenADMET, we’re proud to repost his latest here. We are happy to have you, Pat!

Starting today, September 15, 2025, I will assume a new role as Chief Scientist at OpenADMET, an open science initiative that combines high-throughput experimentation, computation, and structural biology to enhance the understanding and prediction of absorption, distribution, metabolism, excretion, and toxicity (ADMET). After spending the last twenty-five years working on open science as a “hobby,” I am very excited to pursue it as my full-time career.
I have spent the past thirty years working on a wide range of drug discovery projects. During that time, I have noticed that potency optimization is rarely the primary cause of delays in drug discovery. Instead, teams usually focus most of their efforts on improving pharmacokinetics and reducing off-target interactions that could cause adverse side effects. They encounter challenges managing interactions with cytochrome P450 enzymes, which can lead to drug-drug interactions, hERG inhibition—potentially affecting normal heart function—and other unintended effects. Unlike potency optimization, where tools like free energy perturbation (FEP) guide decisions, ADMET optimization often depends on heuristics based on hard-earned experience.
Mark Murcko and James Fraser introduced the term “avoidome” to describe targets that our drug candidates should avoid. For more background on the avoidome and the reasons behind OpenADMET, I recommend watching Mark’s recent talk at Drug Hunter. Essentially, OpenADMET tackles the avoidome by combining three components: targeted data generation, structural insights from x-ray crystallography and cryoEM, and machine learning. This combined approach allows us to better understand the factors that influence interactions with avoidome targets. By gaining this knowledge, we can develop reusable strategies to steer clear of these targets and break the frustrating cycle of “whack-a-mole”—where progress is often undone by unexpected setbacks—that frequently occurs in drug discovery projects.
I strongly believe that close interaction between computation and experimentation is crucial for advancing drug discovery. This is one of the main reasons I am enthusiastic about OpenADMET. We can not only generate data and build models but also develop structural insights to help interpret that data. For example, consider hERG inhibition. As mentioned earlier, hERG inhibition can cause potentially fatal disruptions in normal cardiac rhythm. The literature contains examples where medicinal chemistry teams have combined intuition and brute-force synthesis to reduce hERG liability. Instead of just collecting data to build a model, we can collaborate with our colleagues at UCSF to determine experimental protein-ligand structures and understand the relationship between chemical structure and hERG binding. Additionally, with team members at OctantBio, we can synthesize and test additional analogs to further explore what drives activity, an informative exercise usually reserved for on-target activity. This collaboration among assays, ML models, and structures will help us better understand outliers and cases where our models fall short. We are fortunate to have initial funding that we believe will enable us to test these ideas and accelerate the development of better ML models.
Machine Learning to the Rescue?
Over the past decade, machine learning (ML) models have become vital in modern drug discovery. However, despite their guidance, their success heavily depends on large, high-quality datasets. An ML model relies on three key elements: high-quality training data, the representation—which converts a chemical structure into a vector understandable by the model—and the algorithm that finds the relationship between data and representation. These elements are vital in that order: data is the most important, followed by representation, with algorithms providing smaller, incremental improvements. Sadly, the field has often focused on the wrong aspects. Much attention has been on algorithms like neural networks, only to find that their gains over simpler methods, such as decision tree ensembles, are limited given the dataset sizes and quality. Until recently, there were few initiatives dedicated to generating experimental data specifically to improve ML model development. High-quality experimental data, like that from OpenADMET, can be the foundation for better molecular representation and ML algorithms, strengthening the entire process.
Most of the literature datasets currently used to train and validate ML models were curated, sometimes inaccurately, from dozens of publications, each of which conducts experiments differently. A recent paper by Greg Landrum and Sereina Riniker compared cases where the same compounds were tested in the “same” assay by different groups across papers. When comparing IC50 values, they found almost no correlation between the reported values from different papers. Although Landrum and Riniker didn’t compare ADMET assays, the same lack of correlation exists there as well. Instead of relying on low-quality literature data, we need consistently generated data from relevant assays with compounds similar to those synthesized in drug discovery projects. With these datasets in hand, the field can pursue new advances in molecular representation and algorithms.
The Proof is in the Pudding
While high-quality datasets provide a solid foundation for ML models, questions still remain about how to practically split these datasets for testing model performance. Ultimately, our models should be evaluated prospectively on compounds the model has not previously seen. One of the most effective methods for such prospective testing is through blind challenges, where teams receive a dataset and are asked to submit predictions, which are then compared to ground truth data. One can argue that the recent Nobel Prize-winning success of protein structure prediction methods, such as AlphaFold and RoseTTAFold, would not have been possible without the semiannual Critical Assessment of Protein Structure Prediction (CASP) challenges.
With OpenADMET, we will generate data and use it to regularly host blind challenges. Our aim is not just to publish results but also to share tutorials and hold seminars using challenge datasets to showcase best practices. In collaboration with the teams involved in the ASAP Initiative and Polaris, the OpenADMET team has already organized a blind challenge focused on activity, structure prediction, and ADMET endpoints. More challenges will be announced soon.
Democratizing ADMET Models
So far, I’ve emphasized the data generation capabilities of OpenADMET, but this effort isn’t only for ML enthusiasts. There is also an internal initiative to develop high-quality models and share them with the community. We will create and update models based on data generated in our labs and from other sources. These models will be packaged and made freely accessible. We will also explore the best ways to combine data from multiple sources, update models with new data, and incorporate experimental data to improve model performance. By working openly and incorporating community feedback on the design and performance of the models, we can enhance their scope, accuracy, and usability in real-world applications.
Answering Some Fundamental Questions
Besides supplying data for model training and future assessments, OpenADMET will serve as a platform to tackle some unresolved core issues facing ML in drug discovery.
- Molecular representation - Many of the most successful ML approaches use molecular representations, such as chemical fingerprints, which have been utilized for decades. Although many other methods have been proposed, it has been hard to determine if they genuinely improve performance. The datasets created by OpenADMET and related efforts, like OpenBind, will enable robust prospective and retrospective comparisons of molecular representations. Hopefully, these datasets will also encourage progress that pushes the field beyond simple molecular graphs toward a more generalizable description of chemical structure.
- Defining a model’s applicability domain - The OpenADMET datasets will help us systematically analyze the relationship between training data and a set of compounds whose properties need to be predicted. These datasets can support the community in proposing and assessing methods for identifying where models are likely to succeed and where they might fail.
- Global vs. local models - Although there is ongoing debate about whether global models outperform series-specific, local models, few systematic comparisons between the two methods have been made. OpenADMET will gather diverse datasets, along with data on specific chemical series. These datasets will allow us to evaluate and compare different model-building approaches.
- Multi-task models - In multitask ML, a model is trained to predict multiple outcomes simultaneously. The literature includes examples where multi-task learning has been beneficial, as well as cases where it has been less successful. OpenADMET is gathering data on various ADMET-related properties, and these datasets should enable the community to test different multi-task strategies.
- Foundation models and fine-tuning - Over the past few years, the field has made significant progress developing foundation models trained on large datasets and fine-tuned for property prediction and related tasks. Unfortunately, most subsequent validation studies were conducted on low-quality datasets and lacked proper statistical validation. The datasets created by OpenADMET will allow the community to compare a wide range of approaches. When combined with recently published guidelines for comparing ML methods, this data should support more robust comparisons that help us identify genuine advances.
- Quantifying Uncertainty - Using data that a machine learning model was trained on, we should be able to estimate our confidence in a prediction. Although many publications have addressed uncertainty estimation, testing these estimates prospectively has been difficult. The regular data releases from OpenADMET should serve as an excellent testbed for new methods of uncertainty quantification.
In addition to these computational questions, we will also address critical experimental issues such as assay drift and reproducibility. All experimental data will be made publicly available for the community to evaluate and learn from.
We Should Talk
OpenADMET won’t succeed in isolation. Our efforts must be part of a dialogue with the larger community. Over the upcoming weeks and months, I’ll be reaching out to numerous people in industry and academia. If you have thoughts on this very important topic, please share them with me. I’d love to have a conversation.
Acknowledgements
I’d like to thank the members of the OpenADMET team for helpful comments on earlier versions of this post.