7 min read

The Free Energy of Everything: Benchmarking OpenFE

Written by: Josh Horton, PhD


Who is OpenFE and what do they do?

Open Free Energy (OpenFE) sits at the nexus of academia and industry. We are developing an open-source software ecosystem for alchemical binding free energy calculations used to guide pharmaceutical drug design and discovery. You can read more about these types of calculations in the fantastic best practices paper from Mey et al.

Behind the scenes at the OpenFE project, we have been working on our largest benchmarking and validation effort to date. In a massive collaborative endeavour involving 15 pharma partners, we set out to assess the real-world accuracy of our relative binding free energy (RBFE) protocol. This is all while keeping the CI lights green, developing new features and making releases like openfe-1.4, which you can download now!

Always be Benchmarking

Benchmarks play an important role in the development and testing of any software. This is even more important when that software aims to guide drug hunters to select a few drug-like molecules from a candidate pool of hundreds of similar designs to be prioritised for synthesis and testing. Any errors made here could delay or even miss a novel molecule with therapeutic benefit. Hence, we are always benchmarking. This covers everything from minor releases, new protocols, default setting changes, or anything within the ecosystem. 

Typically, benchmarks aim to test the software and protocols on a wide range of scenarios representative of real-world applications to assess performance and help detect regressions. The same is true for binding affinity benchmarks, where a wide range of protein-ligand systems are taken from literature or active drug discovery projects covering diverse chemistries and are assembled into highly curated test systems. 

The most comprehensive example of this so far is that assembled by Ross et al., which has been used to extensively benchmark the commercial FEP+ software. While there are a few issues with such datasets (which you can read more about in a great blog post by Ariana Brenner Clerkin) it does allow for a direct comparison against a state-of-the-art protocol. Since Schrodinger uses it, it answers the questions most of our partners want answered (“how does this perform compared to commercial software”), so that's what we decided to use.

The Benchmark

Now let's dive straight into the results of the public dataset: that's 59 protein-ligand systems with 876 unique ligands, almost 1,200 transformations and over 7,000 calculations run using OpenFE! The figure below accumulates this grand undertaking of over 15 pharma partners and several thousand GPU hours into a single scatter plot. The plot shows the calculated vs experimental binding affinity of each ligand with the colour of each point indicating the error of the prediction, ranging from low error in blue to high in red, with shaded regions added to show the 0.5 and 1 kcal/mol error boundaries. 

Initial comparisons of the experimental and calculated binding affinities using the OpenFE protocol identified an outlier in the PFKFB3 system due to an atom mapping not compatible with the protocol.

Now, you may have noticed that something seems off: we have one outlier prediction which has an error of over 28 kcal/mol. On closer inspection of the transformation, we noticed the atom mapping, which controls how one ligand is transformed into another, was breaking and creating bonds. Our RBFE protocol is not designed for this, resulting in unstable and inaccurate simulations, with the industry partner running this set having reported that they had to restart the calculation many times to get a result. Thanks to their determination however, we have identified and fixed this bug in Kartograf, our 3D atom mapper. This bugfix highlights that the benchmarking is already helping improve the reliability and accuracy of the OpenFE tooling, and we are only scratching the surface of the mountain of data we have collected.

Now let's try that again. After rerunning the two edges affected by this issue, the results are shown in the same style and we now include the published FEP+ results of Ross et al along with some statistical metrics to aid the comparison. 

FEP+ shows superior low error performance over OpenFE when comparing the overall experimental and calculated binding affinities for the 59 public datasets. OpenFE (left) has higher error predictions and error bars, as errors are reported as standard deviations from triplicate simulations, compared to FEP+’s use of cycle closure errors (right).

It’s clear from the plots that overall the FEP+ protocol is performing better than OpenFE, with OpenFE having more dark red high error predictions and outliers. There is still more work to be done to increase the accuracy of our RBFE protocol. 

Despite this, we think these initial results are promising, as we used our default Protocol settings throughout the benchmark which might not be best suited to every scenario. In fact, the FEP+ settings were manually adjusted to maximise performance for each system specifically. 

Perhaps the most important aspect of calculating the binding affinities of a series of ligands is the ability to correctly order them by potency so that the best compounds can be identified. One nice metric to assess this is the “fraction of best” ligands proposed by Christopher Bayly at the free energy workshop in 2024. Since then, this metric has been gaining some traction in the field, and below we show boxplots comparing the distribution of metrics including this one calculated for each system for the OpenFE and FEP+ protocols.

OpenFE shows competitive performance with state-of-the-art commercial software at ranking compounds despite high overall errors. Comparison of error and correlation statistics of the 59 public datasets between OpenFE and FEP+ (Ross et al.) results. Shown are box plots for the relative absolute error (RAE), root mean squared error (RMSE), the coefficient of determination (R2), Kendall’s tau, and the fraction of best ligand.

When comparing the protocols this way, they start to look a lot more similar. Despite this, we still aim to improve the overall accuracy of the protocol going forward and we hope that the clues to do this lie in the pile of data we have collected.

But wait...there's more

That's not yet the end of the story, as our pharma partners went above and beyond and also ran private datasets on in-house projects to get a true real-world indication of OpenFEs performance in production. As these datasets are not manually curated (the data was extracted from automated workflows designed to screen hundreds of ideas weekly to keep up with the design-make-test cycle), we expect them to be hard and diverse, highlighting many more issues with our default protocol. 

We are proud to say that our partners didn’t hold back on this challenge and doubled the size of the benchmark, running another 37 protein-ligand systems with 864 unique ligands. Seeing calculations at this scale is a testament to the reliability of the protocol already and the ease with which calculations can be set up and executed routinely. Now to the overview results!

The private datasets present an increased challenge for the default OpenFE protocol, with a noticeable reduction in accuracy across all metrics. 

There is a noticeable step down in accuracy. Many more outlier predictions with very large errors are present, suggesting that real drug discovery is messy and far from the pristine hand-crafted benchmark systems often used to assess performance. It’s not all doom and gloom, though, as the large errors seem to be system specific and even in these cases we still see good ranking. Take for example this system from Roche below. Clearly we are overestimating the potency of the best compounds, but we are still able to identify them.

The default OpenFE protocol can identify the most potent ligands despite high absolute error.  An example dataset from Roche is shown, where the two most potent compounds are significantly overpredicted by the protocol.

What's next?

Well, we still have a lot more data to untangle, bugs to file and fix, and finally a write-up of this project for a journal before we can consider this done and start … the next round of benchmarks. That's right! We aim to practice what we preach. And anyway, how else will we know if we have improved the accuracy of our Protocol with the findings of this study? And for those who can’t wait for the full write up (preprint coming soon!) you can find all of the inputs of study along with the raw outputs assembled here, feel free to dive in and let us know what you find. 

Finally, we should reflect on how far OpenFE has come in just the first three years of the project. From founding and writing the first few lines of code in 2022 to creating a protocol with competitive performance with state-of-the-art commercial software and running a large-scale collaborative benchmark with 15 different pharma partners, that's quite a journey. Moreover, we are just getting started with new Protocols, improved high throughput task execution and many other enhancements planned for the coming year. There is a lot to be excited about. 

This progress, of course, would not be possible without all those who support our effort, from our pharma partners to our technical advisory committee and the Open Molecular Software Foundation, which hosts us. We are continually thankful!

Remember, always be benchmarking!