PLUMB blab #2: Which Data Should a Useful Protein-Ligand Benchmark Dataset Include?
Written by: Ariana Brenner Clerkin & Aakash Davasam
The Objective: Pick PLUMB Data
To build PLUMB, our protein-ligand benchmark dataset, we need protein-ligand structural data, binding affinity measurements, and detailed data annotations. Choosing the right data to ingest is crucial for PLUMB’s success. Below, we break down the types of data under consideration, their strengths and weaknesses, and how they may contribute to our goal of building a comprehensive, consistently curated, and reproducible database integrating protein-ligand complex structures with affinity data suitable for assessing force fields and free energy calculations.
Structural + Binding Affinity Datasets: The Core of PLUMB
PLUMB is designed to benchmark methods that rely on 3D structural information to predict binding affinity data. Gold-standard data links crystal structures with a clearly defined ligand binding mode directly to experimentally measured binding affinities for the same ligand-receptor pair. But we are also interested in congeneric series where we have affinity data for multiple ligands in the series, even if we only have structural data for one ligand—we can dock the remaining ligands and be relatively confident the binding mode is preserved. While we plan to perform quality control assessments to verify this hypothesis, this approach allows us to greatly expand on the limited number (~15,000) of protein-small molecule X-ray structures in the PDB with available affinity data. This will enable us to develop a much larger, more useful dataset comprising paired structural models and reliable affinity data grouped into many congeneric series.
Here, we survey a number of data sources that we will consider in constructing PLUMB. Only datasets updated within the past two years are considered. Although few datasets provide protein-ligand structural and/or affinity data at scale, the following stand out:
BindingDB [1]
Overview: A large database of binding affinities extracted from journal articles, patents, ChEMBL (filtered to only well-defined protein targets), and other sources (e.g., Psychoactive Drug Screening Program (PDSP) [2], the Drug Design Data Resource (D3R), and the Community Structure Activity Resource (CSAR) [3]).
- Scale: As of February 2025, ~2.9 million protein-ligand affinity measurements (2.4 million from human targets), covering 9,300 target proteins [4].
- Strengths:
- Rich metadata annotations, including assay type annotation
- Continuously supported and updated since 2001 [5].
- Backed up quarterly to UC San Diego Library Research Collections (Chronopolis) to ensure long-term availability.
- Draws from patents; this is crucial because these compounds are more potent and more synthetically accessible on average than compounds drawn from journal articles, although they also seem to have higher molecular weights [1].
- 1,200 pre-defined congeneric series with at least one related X-ray structure available in the PDB in the “Protein-Ligand Validation Set” high-quality subset. This subset was generated by identifying every PDB protein-ligand crystal structure that matched a protein-ligand affinity measurement in BindingDB. Then they identified all other compounds in BindingDB for that protein that have 60% similarity with the crystal ligand (as defined as having a maximum common substructure with the crystal ligand that shared ≥0.6 of the non-hydrogen atoms and also had a Tanimoto similarity of ≥0.6 with the crystal ligand). Series consisting of ≤5 ligands were discarded. Series with the same protein that shared ligands were combined.
- Limitations:
- No direct 3D protein structures; must parse via PDB IDs.
- PDB structures may not perfectly match experimental proteins—some protein-ligand affinities link to PDB IDs with only 85% sequence identity [4]. However, there is a way to filter to only those proteins with 100% sequence match [4].
- It is not denoted which ligand in the congeneric series is the crystallographic ligand, and maximum common substructure information is unavailable.
- Potential data duplications due to the same protein-ligand affinity measurements being reported across various sources (i.e., in primary literature and review articles)
- Data are derived from multiple sources where experimental conditions vary.
- Patent data is highly variable in presentation. For example, affinities may appear “binned” into affinity categories to avoid disclosing too much data to competitors.
- PLUMB plan:
- Include! BindingDB will be the backbone of PLUMB. We will start to build PLUMB using BindingDB’s “Protein-Ligand Validation Set”—1,200 pre-defined congeneric series supported by available structural data. Structures will be parsed from RCSB Protein Data Bank.
RCSB Protein Data Bank (PDB) [6]
Overview: The global archive of 3D biomolecular structures, including many protein-ligand complexes.
- Scale:
- As of April 2025, 14,814 protein-ligand structures with affinities.
- Strengths:
- Essential source of structural data.
- Provides rich metadata, including experimental methods, resolution, protein classifications, and mappings to other databases (such as Uniprot). This is essential for PLUMB’s data annotations.
- Flexible API supports automated metadata parsing.
- Includes biological assembly data, enabling inclusion of systems where the ligand binds at multimer interfaces.
- Limitations:
- Binding affinity data are inconsistently/sparsely available. Some of the binding affinity data links to BindingDB, but these links are not always updated/maintained.
- Ligand relevance varies; some ligands are crystallization artifacts, co-factors, or co-solvents rather than biologically relevant binds.
- Superset of all data, target relevance may vary.
- PLUMB plan:
- Include! Retrieve structural data from PDB that is referenced in BindingDB.
PDBbind & PDBbind+ [7]
Overview: Curated subset of the PDB linking structures to experimentally measured binding affinities from over 50,000 peer-reviewed publications. PDBbind+ is an updated commercial version.
- Scale:
- Free version (2020): 19,443 complexes with affinity data
- PDBbind+ (2025): 27,385 complexes with affinity data; refined subsets available that filter out complexes with various structural or chemical issues.
- Strengths:
- Direct structure-affinity links.
- Structures are pre-processed for docking (reconstructed missing loops/side chains)
- Limitations:
- PBDbind+ requires a paid license
- No open-source code; curation and pre-processing methods unclear.
- No congeneric series annotation.
- Smaller scale than BindingDB (~30K complexes, compared to potential ~2.9M in BindingDB)
- Reported errors processing some of the data with RDKit [8-10].
- PLUMB plan:
- Exclude. Licensing requirements and unclear pre-processing workflow limit usability.
Binding Mother of All Databases (MOAD) [11]
Overview: Database of crystallographic protein-ligand complexes with affinity data.
- Scale:
- 41,409 protein-ligand structures total; 15,223 with affinities
- Strengths:
- Reliable structural and binding data.
- Filters for biologically-relevant ligands.
- 2D and 3D Ligand similarity and protein similarity metrics–useful for virtual screening [11]
- Limitations:
- No longer actively maintained (final update: 2023; website discontinued June 2024)
- Each structure is manually inspected before inclusion in MOAD. This process is incompatible with PLUMB’s automatic-curation goal.
- Although some of the affinity data appears to be archived, the majority of the data was licensed to Chemical Abstract Services.
- PLUMB plan:
- Include for validating PLUMB data curation. Use archived MOAD affinity data to validate PLUMB data ingestion from BindingDB. Where affinities match between PLUMB and MOAD add a high-confidence annotation.
PLINDER [12]
Overview: Large-scale dataset for AI-driven drug discovery. Aggregates structural and annotation data from PDB.
- Scale:
- >400,000 complexes with curated metadata; 78,410 with paired affinities.
- Strengths:
- Extensive documentation.
- Rich annotation (~500 features); including ligand similarity and protein similarity metrics, to enable leakage-minimized train-test data splits.
- Annotation for congeneric series (“Congeneric series IDs”)
- Continuous integration of new data.
- RDKit-compatible ligands [10].
- Limitations:
- Affinity data is inconsistently available for all protein-ligand pairs, and a documented bug says the “parsed affinity values are incorrect and should not be used” in their current state.
- Ligand annotations are limited (e.g., metals, co-factors, and sugars are not clearly distinguished from more drug-like ligands)
- Hard to know if the data will remain freely accessible or consistently maintained, given that PLINDER is the result of an industry collaboration.
- Designed for training structure-based machine learning models. Emphasis on data size over data quality filtering.
- PLINDER lacks ligand diversity, with 15 ligands comprising 40% of PLINDER systems (link). This limits the generalizability of models trained on this data. Furthermore, these 15 ligands are almost exclusively non-druglike. They are made up of sugars (which may be covalent modifications rather than ligands), metal ions, nucleotide derivatives (e.g. ADP, GTP) and coenzymes (e.g. NAP, NAD, FAD).
- Numerous PLINDER systems have missing loops and/or contain non-canonical amino acids that are not annotated. This prevents PLINDER systems from being plug-and-play with commonly used force fields without additional pre-processing.
- PLUMB Plan:
- Monitor. Promising, but PLUMB integration awaits correction of affinity data parsing issues.
BioLiP2 [13]
Overview: A semi-manually curated PDB-derived database of biologically relevant complexes. Annotations include binding affinity data (sourced from BindingDB, Binding MOAD, PDBbind, and literature), and along with functional information (e.g. gene ontology, catalytic residues).
- Scale:
- 942,759 total protein-ligand entries; 37,053 with binding affinity data (with redundant entries because each chain of a biological assembly is represented by its own entry).
- Strengths:
- Provides ligand binding sites.
- Filters out non-biologically relevant ligands found in structures using a partially automated and partially manual procedure.
- Continuously integrates new data.
- Limitations:
- Data entries are repetitive for the purposes of PLUMB. For example, PDB ID 5Y6P is represented by 2,780 receptor-ligand interaction entries. This PDB ID contains the structure for an algae’s phycobilisome: a large protein complex that harvests light for photosynthesis comprising 862 protein subunits [14]. This complex is made up of 25 distinct protein components. So BioLiP2 contains a different entry each time a ligand interacts with a different copy of the same protein subunit.
- Affinity data is largely redundant with other sources. The affinity data is aggregated from other secondary sources: BindingDB, Binding MOAD, PDBbind-CN.
- PLUMB plan:
- Include. Despite the redundancy of the affinity data, BioLiP2’s rich annotations provide valuable additions to PLUMB. In particular, the functional annotations will allow PLUMB users to subset by protein class, and the binding residue annotations may be useful for docking.
Summary Table of Key Datasets Characteristics:
✅Yes ❌No ⚠️Sometimes, Maybe, Partially 🔗Linked, not explicitly included
Final Thoughts/TLDR
A well-balanced benchmark set must represent the diversity of chemical and biological space while maintaining data quality. PLUMB aims to do just that—starting with a robust foundation in BindingDB, anchored by structural data from PDB, cross-validated against BindingMOAD, and augmented with annotations from BioLiP2.
Are there any datasets we missed that you think may be pivotal to PLUMB? Or additional strategies we should consider? Please let us know! Of course, the plan may change as we go and learn more about these data sources by using them. I may hit unexpected obstacles as I continue to ingest and process the data. In fact, the next PLUMB blab will document some early hurdles and how we plan to face them.
Sources
[1] T. Liu et al., “BindingDB in 2024: a FAIR knowledgebase of protein-small molecule binding data,” Nucleic Acids Res., vol. 53, no. D1, pp. D1633–D1644, Jan. 2025, doi: 10.1093/nar/gkae1075.
[2] J. Besnard et al., “Automated design of ligands to polypharmacological profiles,” Nature, vol. 492, no. 7428, pp. 215–220, Dec. 2012, doi: 10.1038/nature11691.
[3] J. B. Jr. Dunbar et al., “CSAR Data Set Release 2012: Ligands, Affinities, Complexes, and Docking Decoys,” J. Chem. Inf. Model., vol. 53, no. 8, pp. 1842–1852, Aug. 2013, doi: 10.1021/ci4000486.
[4] M. Gilson, BindingDB: A Massive, Publicly Accessible, Knowledgebase of Protein-Ligand Binding Data, (Feb. 27, 2025). Accessed: Apr. 25, 2025. [Online Video]. Available: https://www.bindingdb.org/rwd/bind/index.jsp
[5] X. Chen, M. Liu, and M. Gilson, “BindingDB: A Web-Accessible Molecular Recognition Database,” Comb. Chem. High Throughput Screen., vol. 4, no. 8, pp. 719–725, Dec. 2001, doi: 10.2174/1386207013330670.
[6] H. M. Berman, “The Protein Data Bank,” Nucleic Acids Res., vol. 28, no. 1, pp. 235–242, Jan. 2000, doi: 10.1093/nar/28.1.235.
[7] Z. Liu et al., “PDB-wide collection of binding data: current status of the PDBbind database,” Bioinformatics, vol. 31, no. 3, pp. 405–412, Feb. 2015, doi: 10.1093/bioinformatics/btu626.
[8] P. Bryant, A. Kelkar, A. Guljas, C. Clementi, and F. Noé, “Structure prediction of protein-ligand complexes from sequence information with Umol,” Nat. Commun., vol. 15, no. 1, p. 4536, May 2024, doi: 10.1038/s41467-024-48837-6.
[9] H. Stärk, O. Ganea, L. Pattanaik, Dr. R. Barzilay, and T. Jaakkola, “EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction,” in Proceedings of the 39th International Conference on Machine Learning, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., in Proceedings of Machine Learning Research, vol. 162. PMLR, Jul. 2022, pp. 20503–20521. [Online]. Available: https://proceedings.mlr.press/v162/stark22b.html
[10] Vladas Oleinikovas: PLINDER: The protein-ligand interactions dataset and resource, (Oct. 24, 2024). [Online Video]. Available: https://www.youtube.com/watch?v=7-auGX9Z9Nw
[11] S. Wagle, R. D. Smith, A. J. Dominic, D. DasGupta, S. K. Tripathi, and H. A. Carlson, “Sunsetting Binding MOAD with its last data update and the addition of 3D-ligand polypharmacology tools,” Sci. Rep., vol. 13, no. 1, p. 3008, Feb. 2023, doi: 10.1038/s41598-023-29996-w.
[12] J. Durairaj et al., “PLINDER: The protein-ligand interactions dataset and evaluation resource,” Jul. 17, 2024, Biochemistry. doi: 10.1101/2024.07.17.603955.
[13] C. Zhang, X. Zhang, L. Freddolino, and Y. Zhang, “BioLiP2: an updated structure database for biologically relevant ligand–protein interactions,” Nucleic Acids Res., vol. 52, no. D1, pp. D404–D412, Jan. 2024, doi: 10.1093/nar/gkad630.
[14] J. Zhang et al., “Structure of phycobilisome from the red alga Griffithsia pacifica,” Nature, vol. 551, no. 7678, pp. 57–63, Nov. 2017, doi: 10.1038/nature24278.