Creating Datasets

In this example we will show how openff-qcsubmit can be used to curate QCFractal compatible datasets that can be submitted to public or local fractals instances (e.g. the QCArchive).

In particular, we will show how the framework can be used to define reproducible workflows for curating datasets by processing large lists of molecules and applying filtering and other useful operations such as state enumeration and fragmentation, via the API. Further, we will demonstrate how such a workflow can be exported to a settings file that can then be used to reconstruct the entire workflow by another user.

For the sake of clarity all verbose warnings will be disabled in this tutorial:

[1]:
import warnings
warnings.filterwarnings('ignore')
import logging
logging.getLogger("openff.toolkit").setLevel(logging.ERROR)

Creating a dataset factory

The openff-qcsubmit package provides a number of dataset ‘factories’. A factory is a reusable object that encodes all the core settings that will be used to curate / compute a dataset from an input list of molecule.

Here we will begin by creating a ‘basic’ data set factory:

[2]:
from qcportal.models.common_models import DriverEnum

from openff.qcsubmit.factories import BasicDatasetFactory
from openff.qcsubmit.common_structures import QCSpec

factory = BasicDatasetFactory(
    driver=DriverEnum.energy,
    qc_specifications={
        "default": QCSpec(),
        "ani1ccx": QCSpec(
            program="torchani",
            method="ani1ccx",
            basis=None,
            spec_name="ani1ccx",
            spec_description="ANI1ccx standard specification"
        )
    }
)

factory
[2]:
BasicDatasetFactory(qc_specifications={'default': QCSpec(method='B3LYP-D3BJ', basis='DZVP', program='psi4', spec_name='default', spec_description='Standard OpenFF optimization quantum chemistry specification.', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>, implicit_solvent=None), 'ani1ccx': QCSpec(method='ani1ccx', basis=None, program='torchani', spec_name='ani1ccx', spec_description='ANI1ccx standard specification', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>, implicit_solvent=None)}, maxiter=200, driver=<DriverEnum.energy: 'energy'>, scf_properties=[<SCFProperties.Dipole: 'dipole'>, <SCFProperties.Quadrupole: 'quadrupole'>, <SCFProperties.WibergLowdinIndices: 'wiberg_lowdin_indices'>, <SCFProperties.MayerIndices: 'mayer_indices'>], priority='normal', dataset_tags=['openff'], compute_tag='openff', type='BasicDatasetFactory', workflow=[])

that will be responsible for creating a ‘basic’ dataset that will contain a collection of single point calculations provided through the energy/gradient/hessian drivers. Dataset factories are also available for optimization and torsion drive data sets.

Here we have specified that datasets created using this factory should be computed using two different ‘quantum chemical’ (QC) specfications:

  • default: the OpenFF default specification which employs B3LYP-D3BJ+DZVP using psi4.

  • ani1ccx: ANI1ccx provided by the torchani package.

Note that the basic settings should be suitable in most cases as they are those recommended by the OpenFF Consortium and are currently used in the fitting of the most recent force fields.

Now lets look at the workflow components that will be used to curate our initial set of molecules.

[3]:
factory.workflow
[3]:
[]

The workflow is a list that contains the components that will be executed in order, and by default will be empty.

openff-qcsubmit provides a suite of common curation components, such as to filter out molecules that contain unsupported elements, or to generate a set of conformers for each molecule.

Here we will set up a workflow that will filter out some unwanted elements (i.e. those not supported by ANI1), then filter by molecular weight, and finally generate conformers for each of the molecules passing through the factory.

First we set up the element filter:

[4]:
from openff.qcsubmit import workflow_components

el_filter = workflow_components.ElementFilter(
    allowed_elements=[1, 6, 7, 8]
)

factory.add_workflow_components(el_filter)

This filter has the ability to filter elements by atomic name or number, we just have to supply a list of symbols or numbers to the filter. Here we only keep molecules with elements of H,C,N and O as we would like to use AN1 as our QC method.

Now we will set up the weight filter and conformer generation components and add them to the workflow.

[5]:
weight_filter = workflow_components.MolecularWeightFilter()
factory.add_workflow_components(weight_filter)

conf_gen = workflow_components.StandardConformerGenerator(
    max_conformers=1,
    toolkit="rdkit"
)
factory.add_workflow_components(conf_gen)

Let’s look at the workflow and make sure all the components were correctly added:

[6]:
factory.workflow
[6]:
[ElementFilter(type='ElementFilter', allowed_elements=[1, 6, 7, 8]),
 MolecularWeightFilter(type='MolecularWeightFilter', minimum_weight=130, maximum_weight=781),
 StandardConformerGenerator(type='StandardConformerGenerator', toolkit='rdkit', rms_cutoff=None, max_conformers=1, clear_existing=True)]

and save the settings and workflow so they can be used again later:

[7]:
factory.export_settings("example-factory.json")
factory.export_settings("example-factory.yaml")

While workflows can be exported to several formats, the most commonly used are JSON and YAML as shown in this example. Let’s look at the JSON output:

[8]:
! head -n 20 example-factory.json
{
  "qc_specifications": {
    "default": {
      "method": "B3LYP-D3BJ",
      "basis": "DZVP",
      "program": "psi4",
      "spec_name": "default",
      "spec_description": "Standard OpenFF optimization quantum chemistry specification.",
      "store_wavefunction": "none",
      "implicit_solvent": null
    },
    "ani1ccx": {
      "method": "ani1ccx",
      "basis": null,
      "program": "torchani",
      "spec_name": "ani1ccx",
      "spec_description": "ANI1ccx standard specification",
      "store_wavefunction": "none",
      "implicit_solvent": null
    }

These settings can be re-imported easily using the API:

[9]:
imported_factory = BasicDatasetFactory.from_file("example-factory.json")

Creating the dataset

We can run the workflow on an example set of molecules:

[10]:
from openff.toolkit.topology import Molecule

mols = [
    Molecule.from_smiles(smiles)
    for smiles in [
        "[H]/N=C(/N)\\Nc1[nH]nnn1",
        "c1cc[nH+]cc1",
        "C[N+](C)(C)[O-]",
        "CONC(=O)N",
        "c1ccc2c(c1)cc[nH]2",
        "c1ccc(cc1)/N=C\\NO",
        "C=CO",
        "c1cocc1[O-]",
        "CC(=O)NO",
        "C[N+](=C)C",
        "C(=O)C=O",
        "C=C",
        "CC1=NC(=NC1=[N+]=[N-])Cl",
        "c1cc[n+](cc1)[O-]",
        "CN(C)O",
        "N(=O)(=O)O",
        "CC=O",
        "c1cc(oc1)c2ccco2",
        "CC",
        "C1C=CC(=O)C=C1",
    ]
]

This is as simple as calling the factories create_dataset method and providing the set of molecules as input:

[11]:
dataset = factory.create_dataset(
    molecules=mols,
    dataset_name="example-dataset",
    description="An example dataset.",
    tagline="An example dataset."
)
dataset
Deduplication                 : 100%|██████████| 20/20 [00:00<00:00, 686.53it/s]
ElementFilter                 : 100%|██████████| 20/20 [00:00<00:00, 476.25it/s]
MolecularWeightFilter         : 100%|██████████| 19/19 [00:00<00:00, 110.36it/s]
StandardConformerGenerator    : 100%|█████████████| 2/2 [00:00<00:00,  8.46it/s]
Preparation                   : 100%|█████████████| 2/2 [00:00<00:00, 58.50it/s]
[11]:
BasicDataset(qc_specifications={'default': QCSpec(method='B3LYP-D3BJ', basis='DZVP', program='psi4', spec_name='default', spec_description='Standard OpenFF optimization quantum chemistry specification.', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>, implicit_solvent=None), 'ani1ccx': QCSpec(method='ani1ccx', basis=None, program='torchani', spec_name='ani1ccx', spec_description='ANI1ccx standard specification', store_wavefunction=<WavefunctionProtocolEnum.none: 'none'>, implicit_solvent=None)}, maxiter=200, driver=<DriverEnum.energy: 'energy'>, scf_properties=[<SCFProperties.Dipole: 'dipole'>, <SCFProperties.Quadrupole: 'quadrupole'>, <SCFProperties.WibergLowdinIndices: 'wiberg_lowdin_indices'>, <SCFProperties.MayerIndices: 'mayer_indices'>], priority='normal', dataset_tags=['openff'], compute_tag='openff', dataset_name='example-dataset', dataset_tagline='An example dataset.', dataset_type='DataSet', description='An example dataset.', metadata=Metadata(submitter='boothros', creation_date=datetime.date(2021, 6, 22), collection_type='DataSet', dataset_name='example-dataset', short_description='An example dataset.', long_description_url=None, long_description='An example dataset.', elements={'C', 'H', 'N', 'O'}), provenance={'openff-qcsubmit': '0.2.1+43.g60a8b1e.dirty', 'openff-toolkit': '0.9.1', 'openeye': '2019.Oct.2'}, dataset={'c1ccc(cc1)/N=C\\NO': DatasetEntry(index='c1ccc(cc1)/N=C\\NO', initial_molecules=[Molecule(name='C7H8N2O', formula='C7H8N2O', hash='564b158')], attributes=MoleculeAttributes(canonical_smiles='c1ccc(cc1)N=CNO', canonical_isomeric_smiles='c1ccc(cc1)/N=C\\NO', canonical_explicit_hydrogen_smiles='[H]c1c(c(c(c(c1[H])[H])N=C([H])N([H])O[H])[H])[H]', canonical_isomeric_explicit_hydrogen_smiles='[H]c1c(c(c(c(c1[H])[H])/N=C(/[H])\\N([H])O[H])[H])[H]', canonical_isomeric_explicit_hydrogen_mapped_smiles='[H:11][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1[H:13])[H:15])/[N:8]=[C:7](/[H:16])\\[N:9]([H:17])[O:10][H:18])[H:14])[H:12]', molecular_formula='C7H8N2O', standard_inchi='InChI=1S/C7H8N2O/c10-9-6-8-7-4-2-1-3-5-7/h1-6,10H,(H,8,9)', inchi_key='FEUZPLBUEYBLTN-UHFFFAOYSA-N', fixed_hydrogen_inchi='InChI=1/C7H8N2O/c10-9-6-8-7-4-2-1-3-5-7/h1-6,10H,(H,8,9)/f/h9H/b8-6-', fixed_hydrogen_inchi_key='FEUZPLBUEYBLTN-NAFDMULTNA-N', provenance={'openff-qcsubmit': '0.2.1+43.g60a8b1e.dirty', 'openff-toolkit': '0.9.1', 'openeye': '2019.Oct.2'}), extras={'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[H:11][c:1]1[c:2]([c:4]([c:6]([c:5]([c:3]1[H:13])[H:15])/[N:8]=[C:7](/[H:16])\\[N:9]([H:17])[O:10][H:18])[H:14])[H:12]'}, keywords={}), 'C1=COC(=C1)C2=CC=CO2': DatasetEntry(index='C1=COC(=C1)C2=CC=CO2', initial_molecules=[Molecule(name='C8H6O2', formula='C8H6O2', hash='06d0a08')], attributes=MoleculeAttributes(canonical_smiles='C1=COC(=C1)C2=CC=CO2', canonical_isomeric_smiles='C1=COC(=C1)C2=CC=CO2', canonical_explicit_hydrogen_smiles='[H]C1=C(OC(=C1[H])C2=C(C(=C(O2)[H])[H])[H])[H]', canonical_isomeric_explicit_hydrogen_smiles='[H]C1=C(OC(=C1[H])C2=C(C(=C(O2)[H])[H])[H])[H]', canonical_isomeric_explicit_hydrogen_mapped_smiles='[H:11][C:1]1=[C:5]([O:9][C:7](=[C:3]1[H:13])[C:8]2=[C:4]([C:2](=[C:6]([O:10]2)[H:16])[H:12])[H:14])[H:15]', molecular_formula='C8H6O2', standard_inchi='InChI=1S/C8H6O2/c1-3-7(9-5-1)8-4-2-6-10-8/h1-6H', inchi_key='UDHZFLBMZZVHRA-UHFFFAOYSA-N', fixed_hydrogen_inchi='InChI=1/C8H6O2/c1-3-7(9-5-1)8-4-2-6-10-8/h1-6H', fixed_hydrogen_inchi_key='UDHZFLBMZZVHRA-UHFFFAOYNA-N', provenance={'openff-qcsubmit': '0.2.1+43.g60a8b1e.dirty', 'openff-toolkit': '0.9.1', 'openeye': '2019.Oct.2'}), extras={'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[H:11][C:1]1=[C:5]([O:9][C:7](=[C:3]1[H:13])[C:8]2=[C:4]([C:2](=[C:6]([O:10]2)[H:16])[H:12])[H:14])[H:15]'}, keywords={})}, filtered_molecules={'ElementFilter': FilterEntry(component='ElementFilter', component_settings={'type': 'ElementFilter', 'allowed_elements': [1, 6, 7, 8]}, component_provenance={'openff-toolkit': '0.9.1', 'openff-qcsubmit': '0.2.1+43.g60a8b1e.dirty', 'openmm_elements': '7.5'}, molecules=['[H]C([H])([H])C1=NC(=NC1=[N+]=[N-])Cl']), 'MolecularWeightFilter': FilterEntry(component='MolecularWeightFilter', component_settings={'type': 'MolecularWeightFilter', 'minimum_weight': 130, 'maximum_weight': 781}, component_provenance={'openff-toolkit': '0.9.1', 'openff-qcsubmit': '0.2.1+43.g60a8b1e.dirty', 'openmm_units': '7.5'}, molecules=['[H]/N=C(/N([H])[H])\\N([H])C1=NN=NN1[H]', '[H]c1c(c([n+](c(c1[H])[H])[H])[H])[H]', '[H]C([H])([H])[N+](C([H])([H])[H])(C([H])([H])[H])[O-]', '[H]C([H])([H])ON([H])C(=O)N([H])[H]', '[H]c1c(c(c2c(c1[H])C(=C(N2[H])[H])[H])[H])[H]', '[H]C(=C([H])O[H])[H]', '[H]C1=C(OC(=C1[O-])[H])[H]', '[H]C([H])([H])C(=O)N([H])O[H]', '[H]C(=[N+](C([H])([H])[H])C([H])([H])[H])[H]', '[H]C(=O)C(=O)[H]', '[H]C(=C([H])[H])[H]', '[H]c1c(c([n+](c(c1[H])[H])[O-])[H])[H]', '[H]C([H])([H])N(C([H])([H])[H])O[H]', '[H]ON(=O)=O', '[H]C(=O)C([H])([H])[H]', '[H]C([H])([H])C([H])([H])[H]', '[H]C1=C(C(C(=C(C1=O)[H])[H])([H])[H])[H]']), 'StandardConformerGenerator': FilterEntry(component='StandardConformerGenerator', component_settings={'type': 'StandardConformerGenerator', 'toolkit': 'rdkit', 'rms_cutoff': None, 'max_conformers': 1, 'clear_existing': True}, component_provenance={'openff-toolkit': '0.9.1', 'openff-qcsubmit': '0.2.1+43.g60a8b1e.dirty', 'rdkit': '2020.09.5'}, molecules=[])})

We can easily see how many molecules the dataset contains after filtering:

[12]:
dataset.n_molecules
[12]:
2

and how many QC ‘records’ will be computed for this dataset:

[13]:
dataset.n_records
[13]:
2

We can iterate over the molecules in the dataset:

[14]:
for molecule in dataset.molecules:
    print(molecule.to_smiles(explicit_hydrogens=False))
c1ccc(cc1)/N=C\NO
C1=COC(=C1)C2=CC=CO2

as well as those that were filtered out during its construction:

[15]:
for molecule in dataset.filtered:
    print(molecule.to_smiles(explicit_hydrogens=False))
CC1=NC(=NC1=[N+]=[N-])Cl
[H]/N=C(/N)\NC1=NN=NN1
c1cc[nH+]cc1
C[N+](C)(C)[O-]
CONC(=O)N
c1ccc2c(c1)C=CN2
C=CO
C1=COC=C1[O-]
CC(=O)NO
C[N+](=C)C
C(=O)C=O
C=C
c1cc[n+](cc1)[O-]
CN(C)O
N(=O)(=O)O
CC=O
CC
C1C=CC(=O)C=C1

The fully created dataset is readily exportable to JSON:

[16]:
dataset.export_dataset("example-dataset.json")

and the molecules it contains can be exported to various formats:

[17]:
dataset.molecules_to_file("example-dataset.smi", "smi")
dataset.molecules_to_file("example-dataset.inchi", "inchi")
dataset.molecules_to_file("example-dataset.inchikey", "inchikey")

The molecules contained within a dataset can also be easily visualized by exporting the dataset to a PDF:

[18]:
dataset.visualize("example-dataset.pdf")