3 min read

How to Access the ASAP Discovery x PolarisHub x OpenADMET Challenge Data

Written by: Hugo MacDermott-Opeskin, PhD


The ASAP Discovery Consortium, Polaris Hub, and OpenADMET recently concluded a joint blind predictive modeling challenge of computational methods for drug discovery. You can see the results here, with some exciting performance from a variety of competitors!

Since the challenge concluded we—the challenge organisers—have received numerous inquiries as to how to obtain the data that was used in the challenge for further benchmarking and follow-up studies. While ASAP Discovery is in the process of depositing all data to open FAIR repositories such as the Protein Data Bank (PDB) and ChEMBL, these depositions are still ongoing. In the meantime, to empower the community we thought we would put together a short blog post detailing how to access the data from Polaris Hub directly. 

You can get the data from Polaris Hub using the following syntax for each of the challenges. Note that the Set column in the resulting Pandas DataFrames will tell you whether the row was in the train or test split for the challenge.  We also have a Zenodo dataset containing all of the raw data in case it would be easier to use in a more general workflow https://doi.org/10.5281/zenodo.1558206 (you can use this DOI to cite the dataset as a whole).

Polaris provides all of the challenge data (and other datasets) in an ML ready format. This makes it easy to get straight into training your models without having to parse specialised formats or interpret details from a previous paper or benchmark. Read more about how the Polaris team designed their platform with best practices built in here. Additionally for the competition, we made all of our evaluation logic publicly available on github here, so that others can reproduce the evaluation in 

Ligand poses dataset

We will show a more detailed example for the ligand-poses dataset to show the power of the Polaris API. 

Participants were tasked with predicting the bound Ligand Pose, given the CXSMILES, Chain A Sequence, Chain B Sequence and Protein Label. 

import polaris as po
import pandas as pd

# load the dataset from the Hub
dataset = po.load_dataset("asap-discovery/antiviral-ligand-poses-2025-unblinded")

# Get information on the dataset size
dataset.size()

# Load a datapoint in memory
dataset.get_data(
    row=dataset.rows[0],
    col=dataset.columns[0],
)

# convert the whole dataset to a dataframe (may take a while for download)
df = pd.DataFrame(dataset[:])

import datamol as dm
# write row 0 ligand to a file
mol = df["Ligand Pose"][0]
dm.to_sdf(mol, "/path/to/mol.sdf")(, "example.pdb")

# write row 0 PDB to a file 
atom_array = dataset[0]["Complex Structure"]
out_file = fastpdb.PDBFile()
out_file.set_structure(atom_array)
out_file.write("path/to/another_file.pdb")

The crystallography experiments for this sub-challenge were performed by the University of Oxford and Diamond Light Source. See here and here for the crystallography conditions.

Potency dataset

import polaris as po
import pandas as pd
# load the dataset from the Hub
dataset = po.load_dataset("asap-discovery/antiviral-admet-2025-unblinded")
# convert to dataframe
df = pd.DataFrame(dataset[:])

The potency dataset is slightly easier to work with as the target values are numeric (pCI50s). Remember that the resulting multi-task matrix is sparse, meaning that not every row will have a value for pIC50 (SARS-CoV-2-Mpro) and pIC50 (MERS-CoV-Mpro). You may need to mask NaN values when building your models, but this is left up to you. The assays for this sub-challenge were performed by the Weizmann Institute of Science. See here and here for the experimental conditions. 

ADMET dataset

import polaris as po
import pandas as pd
# load the dataset from the Hub
dataset = po.load_dataset("asap-discovery/antiviral-admet-2025-unblinded")
# convert to dataframe
df = pd.DataFrame(dataset[:])

Similar to the potency dataset, the multitask matrix is sparse across the 5 ADMET endpoints (HLM, MLM, KSOL, Permeability, and LogD).  The assays for this sub-challenge were performed by Bienta with protocols for microsomal stability, KSOL, Permeability and  LogD at each respective link. 

Good benchmarks are everything

The above will allow you to use the data exactly as it was used in the challenge. This kind of reproducibility highlights the power of platforms like Polaris, which enables living comparison and single source of truth benchmarks such that performant models and architectures can be unearthed over time. Programmatic reproducibility removes the influence of factors such as dataset handling and splitting on benchmark outcomes. We have recently certified the challenge datasets on Polaris Hub, as they fit the criteria of:

1) being relevant to drug discovery
2) come from a single source (ASAP) and
3) do not contain obvious errors or ambiguous data.

Polaris provides more detail at the following link.

We hope that this short blog post will give participants the tools they need to continue to engage with the challenge data in advance of publication of participant papers and a summary paper in the Journal of Chemical Information and Modelling.