Benchmarking Data Efficiency in Advanced Machine Learning Models for Quantum Chemistry

V. Vinod, P. Zaspel; Benchmarking data efficiency in Δ-ML and multifidelity models for quantum chemistry. J. Chem. Phys. 163 (2): 024134, 2025. DOI: 10.1063/5.0272457; also available as arXiv.2410.11391.

We are pleased to announce the publication of new research titled “Benchmarking data efficiency in Δ-ML and multifidelity models for quantum chemistry” in The Journal of Chemical Physics. This work, co-authored by Vivin Vinod and Peter Zaspel, benchmarks a critical component in machine learning for computational quantum chemistry: the high overhead cost associated with generating training data for machine learning (ML) models.

The development of ML methods has significantly enhanced the accessibility of quantum chemistry (QC) calculations by lowering their computational expense. However, this has shifted the focus to the efficiency of training data generation. Our latest study provides a comprehensive benchmark of the time-cost versus model accuracy for various cutting-edge multifidelity machine learning approaches in addition to a newly contributed methodological development called Multifidelity Δ Machine Learning.

Key Contributions of the Research:

Comprehensive Benchmarking: The study rigorously compares the data costs of several advanced ML methods: Δ-ML, multifidelity machine learning (MFML), optimized MFML (o-MFML), and a newly introduced method, Multifidelity Δ-Machine Learning (MFΔML). This assessment is based on the cost of generating training data for each model, directly contrasted with the single-fidelity kernel ridge regression approach.
Leveraging the QeMFi Dataset: For a uniform and robust assessment, the research utilized the QeMFi dataset, which comprises 135,000 geometries of nine chemically diverse molecules, each with five different fidelities of QC properties calculated using the time-dependent density functional theory (TD-DFT) formalism (STO-3G, 3-21G, 6-31G, def2-SVP, and def2-TZVP basis sets).
Predictive Capabilities: The models were evaluated for their ability to predict essential QC properties, including ground state energies, first and second vertical excitation energies, and the magnitude of electronic contribution to molecular dipole moments.
Optimizing for Different Prediction Scenarios: The results indicate that multifidelity methods generally outperform standard Δ-ML approaches when a large number of predictions are required. Furthermore, the newly developed MFΔML method offers a distinct advantage over conventional Δ-ML in applications where only a limited number of predictions or evaluations are needed.

This research is instrumental in guiding the selection of optimal ML methodologies for quantum chemistry, significantly contributing to the development of more efficient and cost-effective computational pipelines. It provides valuable insights for researchers aiming to accelerate discoveries in materials science and chemistry by minimizing the computational burden of high-accuracy calculations.