A new database of inorganic materials is available on the Materials Cloud
By Nicola Nosengo/NCCR MARVEL
Reliable and reproducible materials data are a cornerstone of modern computational materials science. They enable researchers to compare theoretical predictions across large families of materials and provide valuable resources for data-driven approaches, including machine-learning models used to discover new materials. However, many existing computational materials databases rely on different density functional theory (DFT) settings and computational protocols, which introduces inconsistencies across datasets.
To address this challenge, a team of NCCR MARVEL scientists, led by researchers at EPFL and PSI, has introduced the Materials Cloud Three-Dimensional Structure Database (MC3D), a systematically curated database of quantum-mechanical calculations for inorganic materials derived from experimental crystal structures. The database contains more than 32 000 structures whose relaxed geometry and electronic structure were computed using carefully standardized DFT workflows, using three different functionals and/or computational protocols.
Diagram visualizing the pipeline that filtered the 901 210 CIF files, as imported from the COD, ICSD, and MPDS databases, down to the MC3D-source collection of 72 589 unique stoichiometric inorganic crystal structures. Red branches indicate structures that were discarded, while green branches correspond to structures that made their way into the following filtering step
The work is described in a recent article published in the journal Digital Discovery. To build the database, the researchers started from almost a million crystal structures reported in three major crystallographic repositories: the Crystallographic Open Database (COD), the Inorganic Crystal Structure Database (ICSD) and the Materials Platform for Data Science (MPDS). These structures were then passed through a series of automated filtering and validation steps to identify well-defined inorganic compounds suitable for large-scale electronic-structure calculations.
From this starting set, the team generated a curated database of 32,013 materials whose atomic structures were optimized using DFT. The calculations were performed using refined computational protocols and automated workflows built with the open-source workflow engine AiiDA and computed using the electronic-structure code Quantum ESPRESSO, powered by the SIRIUS library. The workflows were executed on supercomputers at the Swiss National Supercomputing Centre (CSCS), fully exploiting the ALPS research infrastructure and enabling the systematic optimization of thousands of structures with consistent computational settings.
A key feature of MC3D is that the entire workflow used to generate the data, from importing crystal structures to running the DFT calculations, is fully reproducible using an open software stack. This makes the database not only a valuable source of computed materials properties, but also a transparent and reusable framework for generating large-scale materials datasets.
“One of the main goals of MC3D was to make the entire process fully reproducible,” explains Sebastiaan P. Huber, first author of the study. “From the original crystal structures to the final electronic-structure calculations, every step can be repeated using open tools and workflows".
Beyond providing a consistent reference dataset for computational materials science, MC3D also supports emerging data-driven approaches. For example, it served as a starting point for the MAD dataset used by Michele Ceriotti’s group at EPFL to train the PET-MAD machine-learning interatomic potential.
“Curated and standardized datasets like MC3D are essential for developing reliable machine-learning models in materials science,” says Giovanni Pizzi, corresponding author of the study. “They provide high-quality training data while ensuring full transparency and reproducibility of how the data were generated.”
The MC3D database is openly available through the Materials Cloud Archive, together with a dedicated web application that allows users to explore the structures and calculated properties interactively.
Looking ahead, the researchers see MC3D as a foundation for further developments in data-driven materials discovery. “MC3D provides a carefully curated starting point for a wide range of ongoing research in materials discovery,” says Nicola Marzari, director of NCCR MARVEL and author of the study. “We are already extending it with additional electronic properties and using it to explore applications ranging from superconductors to thermoelectrics and battery materials.”
Low-volume newsletters, targeted to the scientific and industrial communities.
Subscribe to our newsletter