Inputs

Powered by the fantastic openff-toolkit QCSubmit can consume input molecules from a wide range of sources including:

You can read more about each of these inputs below, but in general getting started simply requires you to pass the input to your chosen QCSubmit dataset generation factory via the molecules keyword argument in the create_dataset() function as shown here:

dataset = factory.create_dataset(
    dataset_name="My exotic dataset",
    # pass the single/multiple molecule sdf here
    molecules="my_exotic_sdf.sdf",
    ...
)

QCSubmit will then determine the type of input and process it accordingly using the ComponentResult class which will deduplicate the molecules while preserving unique conformations.

Standard file formats

QCSubmit supports the following individual file formats as well as directories containing a mix of formats, simply provide the path to the target directory and QCSubmit will search through the directory trying to read in molecules for each file.

Molecule objects

In some cases you may want to pre-process the molecules using a custom workflow not yet supported by QCSubmit and thus will have some collection of molecule objects from RDKit, OpenEye or the openff-toolkit. As QCSubmit uses the openff-toolkit Molecule class internally when processing datasets the objects need to be first converted to this type. To ensure the correctness of the conversion convince methods are provided by the molecule class between RDKit and OpenEye objects.

from openff.toolkit.topology import Molecule

# a list of OE and RDKit molecules
processed_mols = [oemol1, oemol2 rdmol1, rdmol2]

# convert to openff.toolkit.topology.Molecule instances
molecules = [Molecule(ref_mol) for ref_mol in processed_mols]

dataset = factory.create_dataset(
    dataset_name="My exotic dataset",
    # pass the list of molecules
    molecules=molecules,
    ...
)

HDF5 files

Warning

HDF5 support is still pre-alpha and so the specification is still evolving.

QCSubmit also supports HDF5 Files following a simple format which is well suited to inputs containing many conformations per molecule. The format consists of one group per molecule stored under the index which should be assigned to the molecule. Two datasets should then be made under this group with the following naming and information

  • conformations: A numpy ndarray containing all of the molecule conformations with shape (n, n_atoms, 3), where n is the number of conformations and n_atoms is the number of atoms in the molecule.

  • smiles: A length 1 list of mapped smiles strings which represents the topology of the entire system.

Note

If the system contains multiple components we should have a single smiles string indexed from 1 to m where m is the total number of atoms, distinguishing individual components using the . separator.

Finally the units of the molecule conformation should be set as an attribute of the conformations dataset under the key units, recognised units are as follows:

  • nanometer(s)

  • angstrom(s)

  • bohr(s)

Demonstration

HDF5 files following this format can then be readily made using the openff-toolkit:

import h5py
import numpy as np
from simtk import unit

output_file = h5py.File("my_exotic_molecules.hdf5", "w")

for molecule in target_molecules: # a list of openff.toolkit.topology.Molecule instances with conformations
    smiles = molecule.to_smiles(isomeric=True, explicit_hydrogens=True, mapped=True)
    conformations = [c.value_in_unit(unit.nanometers) for c in molecule.conformers]
    group = output_file.create_group(molecule.name)
    group.create_dataset('smiles', data=[smiles], dtype=h5py.string_dtype())
    ds = group.create_dataset('conformations', data=np.array(conformations), dtype=np.float32)
    ds.attrs['units'] = 'nanometers'

output_file.close()