Creating Datasets

This example shows how QCSubmit can be used to curate QCFractal-compatible datasets that can be submitted to any fractals instance, such as QCArchive.

In particular, it shows how the framework can be used to define reproducible workflows for curating datasets by processing large lists of molecules. The API makes it easy to include operations like filtering, state enumeration, and fragmentation in these workflows. Further, we will demonstrate how such a workflow can be exported to a settings file that can then be used to reconstruct the entire workflow by another user.

For the sake of clarity all verbose warnings will be disabled in this tutorial:

[1]:
# work around some packaging tension
try:
    import qcportal  # noqa
    from openeye import oechem  # noqa
except:
    pass
[2]:
import warnings

warnings.filterwarnings("ignore")
import logging

logging.getLogger("openff.toolkit").setLevel(logging.ERROR)

Creating a dataset factory

The openff-qcsubmit package provides a number of dataset ‘factories’. A factory is a reusable object that encodes all the core settings that will be used to curate / compute a dataset from an input list of molecule.

Here we will begin by creating a ‘basic’ data set factory:

[3]:
from qcportal.singlepoint import SinglepointDriver

from openff.qcsubmit.common_structures import QCSpec
from openff.qcsubmit.factories import BasicDatasetFactory

factory = BasicDatasetFactory(
    driver=SinglepointDriver.energy,
    qc_specifications={
        "default": QCSpec(),
        "ani1ccx": QCSpec(
            program="torchani",
            method="ani1ccx",
            basis=None,
            spec_name="ani1ccx",
            spec_description="ANI1ccx standard specification",
        ),
    },
)

factory
[3]:
BasicDatasetFactory(qc_specifications={'default': QCSpec(method='B3LYP-D3BJ', basis='DZVP', program='psi4', spec_name='default', spec_description='Standard OpenFF optimization quantum chemistry specification.', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>, implicit_solvent=None, maxiter=200, scf_properties=[<SCFProperties.Dipole: 'dipole'>, <SCFProperties.Quadrupole: 'quadrupole'>, <SCFProperties.WibergLowdinIndices: 'wiberg_lowdin_indices'>, <SCFProperties.MayerIndices: 'mayer_indices'>], keywords={}), 'ani1ccx': QCSpec(method='ani1ccx', basis=None, program='torchani', spec_name='ani1ccx', spec_description='ANI1ccx standard specification', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>, implicit_solvent=None, maxiter=200, scf_properties=[<SCFProperties.Dipole: 'dipole'>, <SCFProperties.Quadrupole: 'quadrupole'>, <SCFProperties.WibergLowdinIndices: 'wiberg_lowdin_indices'>, <SCFProperties.MayerIndices: 'mayer_indices'>], keywords={})}, driver=<SinglepointDriver.energy: 'energy'>, priority='normal', dataset_tags=['openff'], compute_tag='openff', type='BasicDatasetFactory', workflow=[])

This factory is responsible for creating a ‘basic’ dataset that will contain a collection of single point calculations provided through the energy/gradient/hessian drivers. Dataset factories are also available for optimization and torsion drive data sets.

Here we have specified that datasets created using this factory should be computed using two different ‘quantum chemical’ (QC) specifications:

  • default: the OpenFF default specification which employs B3LYP-D3BJ+DZVP using psi4.

  • ani1ccx: ANI1ccx provided by the torchani package.

The default settings are those recommended by the OpenFF Consortium and are currently used in the fitting of the OpenFF force fields.

Now, lets look at the workflow components that will be used to curate our initial set of molecules:

[4]:
factory.workflow
[4]:
[]

workflow is a list that contains the steps that will be executed in the order they will be executed. By default it is empty. Each step is called a “component”.

QCSubmit provides a suite of common curation components, such as to filter out molecules that contain unsupported elements, or to generate a set of conformers for each molecule.

Let’s set up a workflow that will filter out elements that are not supported by ANI1, then filter by molecular weight, and finally generate conformers for each of the molecules passing through the factory.

First we set up the element filter:

[5]:
from openff.qcsubmit import workflow_components

el_filter = workflow_components.ElementFilter(allowed_elements=[1, 6, 7, 8])

factory.add_workflow_components(el_filter)

This filter has the ability to filter elements by symbol or atomic number. Here we only keep molecules that have no elements other than Hydrogen, Carbon, Nitrogen and Oxygen as we would like to use ANI1 as our QC method.

Now we set up the weight filter and conformer generation components and add them to the workflow:

[6]:
weight_filter = workflow_components.MolecularWeightFilter(
    minimum_weight=130,
    maximum_weight=781,
)
factory.add_workflow_components(weight_filter)

conf_gen = workflow_components.StandardConformerGenerator(
    max_conformers=1, toolkit="rdkit"
)
factory.add_workflow_components(conf_gen)

Let’s look at the workflow and make sure all the components were correctly added:

[7]:
factory.workflow
[7]:
[ElementFilter(type='ElementFilter', allowed_elements=[1, 6, 7, 8]),
 MolecularWeightFilter(type='MolecularWeightFilter', minimum_weight=130, maximum_weight=781),
 StandardConformerGenerator(type='StandardConformerGenerator', rms_cutoff=None, max_conformers=1, clear_existing=True)]

We can save the settings and workflow so they can be used again later. Workflows can be saved to several formats, including the popular JSON and YAML:

[8]:
factory.export_settings("example-factory.json")
factory.export_settings("example-factory.yaml")

Let’s look at the JSON output:

[9]:
! head -n 20 example-factory.json
{
  "qc_specifications": {
    "default": {
      "method": "B3LYP-D3BJ",
      "basis": "DZVP",
      "program": "psi4",
      "spec_name": "default",
      "spec_description": "Standard OpenFF optimization quantum chemistry specification.",
      "store_wavefunction": "none",
      "implicit_solvent": null,
      "maxiter": 200,
      "scf_properties": [
        "dipole",
        "quadrupole",
        "wiberg_lowdin_indices",
        "mayer_indices"
      ],
      "keywords": {}
    },
    "ani1ccx": {

These settings can be re-imported easily using the API:

[10]:
imported_factory = BasicDatasetFactory.from_file("example-factory.json")

Creating the dataset

We can run the workflow on an example set of molecules:

[11]:
from openff.toolkit.topology import Molecule

mols = [
    Molecule.from_smiles(smiles)
    for smiles in [
        "[H]/N=C(/N)\\Nc1[nH]nnn1",
        "c1cc[nH+]cc1",
        "C[N+](C)(C)[O-]",
        "CONC(=O)N",
        "c1ccc2c(c1)cc[nH]2",
        "c1ccc(cc1)/N=C\\NO",
        "C=CO",
        "c1cocc1[O-]",
        "CC(=O)NO",
        "C[N+](=C)C",
        "C(=O)C=O",
        "C=C",
        "CC1=NC(=NC1=[N+]=[N-])Cl",
        "c1cc[n+](cc1)[O-]",
        "CN(C)O",
        "N(=O)(=O)O",
        "CC=O",
        "c1cc(oc1)c2ccco2",
        "CC",
        "C1C=CC(=O)C=C1",
    ]
]

This is as simple as calling the factories create_dataset method and providing the set of molecules as input:

[12]:
dataset = factory.create_dataset(
    molecules=mols,
    dataset_name="example-dataset",
    description="An example dataset.",
    tagline="An example dataset.",
)
dataset

Deduplication                 :   0%|                    | 0/20 [00:00<?, ?it/s]

Deduplication                 : 100%|██████████| 20/20 [00:00<00:00, 637.53it/s]


ElementFilter                 :   0%|                    | 0/20 [00:00<?, ?it/s]

ElementFilter                 :   5%|▌           | 1/20 [00:04<01:16,  4.01s/it]

ElementFilter                 : 100%|███████████| 20/20 [00:04<00:00,  4.95it/s]


MolecularWeightFilter         :   0%|                    | 0/19 [00:00<?, ?it/s]

MolecularWeightFilter         :   5%|▋           | 1/19 [00:03<01:00,  3.35s/it]

MolecularWeightFilter         : 100%|███████████| 19/19 [00:03<00:00,  5.57it/s]


StandardConformerGenerator    :   0%|                     | 0/2 [00:00<?, ?it/s]

StandardConformerGenerator    :  50%|██████▌      | 1/2 [00:02<00:02,  2.12s/it]

StandardConformerGenerator    : 100%|█████████████| 2/2 [00:02<00:00,  1.06s/it]


Preparation                   :   0%|                     | 0/2 [00:00<?, ?it/s]

Preparation                   : 100%|█████████████| 2/2 [00:00<00:00, 56.30it/s]

[12]:
BasicDataset(qc_specifications={'default': QCSpec(method='B3LYP-D3BJ', basis='DZVP', program='psi4', spec_name='default', spec_description='Standard OpenFF optimization quantum chemistry specification.', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>, implicit_solvent=None, maxiter=200, scf_properties=[<SCFProperties.Dipole: 'dipole'>, <SCFProperties.Quadrupole: 'quadrupole'>, <SCFProperties.WibergLowdinIndices: 'wiberg_lowdin_indices'>, <SCFProperties.MayerIndices: 'mayer_indices'>], keywords={}), 'ani1ccx': QCSpec(method='ani1ccx', basis=None, program='torchani', spec_name='ani1ccx', spec_description='ANI1ccx standard specification', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>, implicit_solvent=None, maxiter=200, scf_properties=[<SCFProperties.Dipole: 'dipole'>, <SCFProperties.Quadrupole: 'quadrupole'>, <SCFProperties.WibergLowdinIndices: 'wiberg_lowdin_indices'>, <SCFProperties.MayerIndices: 'mayer_indices'>], keywords={})}, driver=<SinglepointDriver.energy: 'energy'>, priority='normal', dataset_tags=['openff'], compute_tag='openff', dataset_name='example-dataset', dataset_tagline='An example dataset.', type='DataSet', description='An example dataset.', metadata=Metadata(submitter='mattthompson', creation_date=datetime.date(2024, 1, 30), collection_type='DataSet', dataset_name='example-dataset', short_description='An example dataset.', long_description_url=None, long_description='An example dataset.', elements={'C', 'N', 'H', 'O'}), provenance={'openff-qcsubmit': '0.50.2+0.g2fa465a.dirty', 'openff-toolkit': '0.15.0', 'RDKitToolkitWrapper': '2023.09.4', 'AmberToolsToolkitWrapper': '22.0'}, dataset={'ON/C=N\\c1ccccc1': DatasetEntry(index='ON/C=N\\c1ccccc1', initial_molecules=[Molecule(name='C7H8N2O', formula='C7H8N2O', hash='3c416ab')], attributes=MoleculeAttributes(canonical_smiles='ONC=Nc1ccccc1', canonical_isomeric_smiles='ON/C=N\\c1ccccc1', canonical_explicit_hydrogen_smiles='[H][O][N]([H])[C]([H])=[N][c]1[c]([H])[c]([H])[c]([H])[c]([H])[c]1[H]', canonical_isomeric_explicit_hydrogen_smiles='[H][O][N]([H])/[C]([H])=[N]\\[c]1[c]([H])[c]([H])[c]([H])[c]([H])[c]1[H]', canonical_isomeric_explicit_hydrogen_mapped_smiles='[c:1]1([H:11])[c:2]([H:12])[c:3]([H:13])[c:4](/[N:7]=[C:8](\\[N:9]([O:10][H:18])[H:17])[H:16])[c:5]([H:14])[c:6]1[H:15]', molecular_formula='C7H8N2O', standard_inchi='InChI=1S/C7H8N2O/c10-9-6-8-7-4-2-1-3-5-7/h1-6,10H,(H,8,9)', inchi_key='FEUZPLBUEYBLTN-UHFFFAOYSA-N', fixed_hydrogen_inchi='InChI=1/C7H8N2O/c10-9-6-8-7-4-2-1-3-5-7/h1-6,10H,(H,8,9)/f/h9H/b8-6-', fixed_hydrogen_inchi_key='FEUZPLBUEYBLTN-NAFDMULTNA-N', unique_fixed_hydrogen_inchi_keys={'FEUZPLBUEYBLTN-NAFDMULTNA-N'}), extras={'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[c:1]1([H:11])[c:2]([H:12])[c:3]([H:13])[c:4](/[N:7]=[C:8](\\[N:9]([O:10][H:18])[H:17])[H:16])[c:5]([H:14])[c:6]1[H:15]'}, keywords={}), 'c1coc(-c2ccco2)c1': DatasetEntry(index='c1coc(-c2ccco2)c1', initial_molecules=[Molecule(name='C8H6O2', formula='C8H6O2', hash='3dbee98')], attributes=MoleculeAttributes(canonical_smiles='c1coc(-c2ccco2)c1', canonical_isomeric_smiles='c1coc(-c2ccco2)c1', canonical_explicit_hydrogen_smiles='[H][C]1=[C]([H])[C]([H])=[C]([C]2=[C]([H])[C]([H])=[C]([H])[O]2)[O]1', canonical_isomeric_explicit_hydrogen_smiles='[H][C]1=[C]([H])[C]([H])=[C]([C]2=[C]([H])[C]([H])=[C]([H])[O]2)[O]1', canonical_isomeric_explicit_hydrogen_mapped_smiles='[C:1]1([H:11])=[C:5]([H:13])[O:4][C:3]([C:6]2=[C:7]([H:14])[C:8]([H:15])=[C:9]([H:16])[O:10]2)=[C:2]1[H:12]', molecular_formula='C8H6O2', standard_inchi='InChI=1S/C8H6O2/c1-3-7(9-5-1)8-4-2-6-10-8/h1-6H', inchi_key='UDHZFLBMZZVHRA-UHFFFAOYSA-N', fixed_hydrogen_inchi='InChI=1/C8H6O2/c1-3-7(9-5-1)8-4-2-6-10-8/h1-6H', fixed_hydrogen_inchi_key='UDHZFLBMZZVHRA-UHFFFAOYNA-N', unique_fixed_hydrogen_inchi_keys={'UDHZFLBMZZVHRA-UHFFFAOYNA-N'}), extras={'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[C:1]1([H:11])=[C:5]([H:13])[O:4][C:3]([C:6]2=[C:7]([H:14])[C:8]([H:15])=[C:9]([H:16])[O:10]2)=[C:2]1[H:12]'}, keywords={})}, filtered_molecules={'ElementFilter': FilterEntry(component='ElementFilter', component_settings={'type': 'ElementFilter', 'allowed_elements': [1, 6, 7, 8]}, component_provenance={'openff-toolkit': '0.15.0', 'openff-qcsubmit': '0.50.2+0.g2fa465a.dirty', 'RDKitToolkitWrapper': '2023.09.4', 'AmberToolsToolkitWrapper': '22.0', 'openff-units_elements': '0.2.1'}, molecules=['[H][C]([H])([H])[C]1=[N][C]([Cl])=[N][C]1=[N+]=[N-]']), 'MolecularWeightFilter': FilterEntry(component='MolecularWeightFilter', component_settings={'type': 'MolecularWeightFilter', 'minimum_weight': 130, 'maximum_weight': 781}, component_provenance={'openff-toolkit': '0.15.0', 'openff-qcsubmit': '0.50.2+0.g2fa465a.dirty', 'RDKitToolkitWrapper': '2023.09.4', 'AmberToolsToolkitWrapper': '22.0'}, molecules=['[H]/[N]=[C](/[N]([H])[H])[N]([H])[C]1=[N][N]=[N][N]1[H]', '[H][c]1[c]([H])[c]([H])[n+]([H])[c]([H])[c]1[H]', '[H][C]([H])([H])[N+]([O-])([C]([H])([H])[H])[C]([H])([H])[H]', '[H][N]([H])[C](=[O])[N]([H])[O][C]([H])([H])[H]', '[H][C]1=[C]([H])[N]([H])[c]2[c]([H])[c]([H])[c]([H])[c]([H])[c]21', '[H][O][C]([H])=[C]([H])[H]', '[H][C]1=[C]([H])[C]([O-])=[C]([H])[O]1', '[H][O][N]([H])[C](=[O])[C]([H])([H])[H]', '[H][C]([H])=[N+]([C]([H])([H])[H])[C]([H])([H])[H]', '[H][C](=[O])[C]([H])=[O]', '[H][C]([H])=[C]([H])[H]', '[H][c]1[c]([H])[c]([H])[n+]([O-])[c]([H])[c]1[H]', '[H][O][N]([C]([H])([H])[H])[C]([H])([H])[H]', '[H][O][N+](=[O])[O-]', '[H][C](=[O])[C]([H])([H])[H]', '[H][C]([H])([H])[C]([H])([H])[H]', '[H][C]1=[C]([H])[C]([H])([H])[C]([H])=[C]([H])[C]1=[O]']), 'StandardConformerGenerator': FilterEntry(component='StandardConformerGenerator', component_settings={'type': 'StandardConformerGenerator', 'rms_cutoff': None, 'max_conformers': 1, 'clear_existing': True}, component_provenance={'openff-toolkit': '0.15.0', 'openff-qcsubmit': '0.50.2+0.g2fa465a.dirty', 'RDKitToolkitWrapper': '2023.09.4', 'AmberToolsToolkitWrapper': '22.0'}, molecules=[])})

We can easily see how many molecules the dataset contains after filtering:

[13]:
dataset.n_molecules
[13]:
2

and how many QC ‘records’ will be computed for this dataset:

[14]:
dataset.n_records
[14]:
2

We can iterate over the molecules in the dataset:

[15]:
for molecule in dataset.molecules:
    print(molecule.to_smiles(explicit_hydrogens=False))
ON/C=N\c1ccccc1
c1coc(-c2ccco2)c1

as well as those that were filtered out during its construction:

[16]:
for molecule in dataset.filtered:
    print(molecule.to_smiles(explicit_hydrogens=False))
CC1=NC(Cl)=NC1=[N+]=[N-]
[H]/N=C(/N)Nc1nnn[nH]1
c1cc[nH+]cc1
C[N+](C)(C)[O-]
CONC(N)=O
c1ccc2[nH]ccc2c1
C=CO
[O-]c1ccoc1
CC(=O)NO
C=[N+](C)C
O=CC=O
C=C
[O-][n+]1ccccc1
CN(C)O
O=[N+]([O-])O
CC=O
CC
O=C1C=CCC=C1

The final dataset is readily exportable to JSON:

[17]:
dataset.export_dataset("example-dataset.json")

and the molecules it contains can be exported to various formats:

[18]:
dataset.molecules_to_file("example-dataset.smi", "smi")
dataset.molecules_to_file("example-dataset.inchi", "inchi")
dataset.molecules_to_file("example-dataset.inchikey", "inchikey")

The molecules contained within a dataset can also be easily visualized by exporting the dataset to a PDF:

[19]:
dataset.visualize("example-dataset.pdf")