Parallelized Calculations and Jobmapping

molli has implemented a jobmap function that enables the parallelized application of external drivers to MoleculeLibrary or ConformerLibrary objects. molli currently has 4 unique drivers for various geometry optimization, conformer generation, and property calculation methods. The molli jobmap function can be used to run parallelized calculations. These can be run through either a local computer or a cluster of computers.

Example 1: Running a Job on a Local Computer

An example script is shown below

##Necessary imports
import molli as ml
from molli.pipeline.crest import CrestDriver

#This is the file the Molecules are retrieved from
source = ml.MoleculeLibrary("example.mlib", readonly=True)

#This is the file the conformer ensembles calculated will be written to.
destination = ml.ConformerLibrary("example_result.clib", readonly=False)

#This configures the driver, number of processes to use for each worker. Can also indicate how much memory to use.
crest = CrestDriver("crest", nprocs=16)

ml.pipeline.jobmap(
    crest.conformer_search,
    source=source, #Source of molecules
    destination=destination, #Where conformers will be written
    cache_dir="./conf_cache", #Where final outputs will be written, successful or not!
    scratch_dir="./scratch_dir", #Scratch Directory where calculations will be run
    n_workers=4, #Number of workers to use. In this case, 4 workers, each with 16 processors as defined in the driver.
    kwargs={
        "method": "gfnff", #GFNFF method to be used
        "temp": 298.15, #Temperature to assume
        "chk_topo": True, #Will check topology
    }, #These are arguments used in the conformer_search function and can be specified directly
    progress=True, #Will print out progress
    verbose = True, #Will print out extra information
)

This will create a Conformer Library with the path example.clib, and all inputs/outputs to ./conf_cache. In the cache directory, there is an input folder which contains the formatted inputs used to submit calculations, as well as the output folder, which contains an encoded output (i.e. written in bytes).

Example 2: Running a Job on a Cluster

In the likely event the user wants to use a computational cluster, a separate function was created for submission of jobs through the scheduler called jobmap_sge. This function was designed for use with clusters configured with the Oracle Grid Engine (also known as Sun Grid Engine) for batch submissions of jobs. This has the same functionality as jobmap, with the only deviation being that the collection of all JobInput instances is passed to a process that runs a qsub command instead of a local executor, and that n_workers no longer needs to be specified.

#Necessary imports
import molli as ml
from molli.pipeline.crest import CrestDriver

#This is the file the Molecules are retrieved from
source = ml.MoleculeLibrary("example.mlib", readonly=True)

#This is the file the conformer ensembles calculated will be written to.
destination = ml.ConformerLibrary("example_result.clib", readonly=False)

#This configures the driver, number of processes to use for each worker. Can also indicate how much memory to use.
crest = CrestDriver("crest", nprocs=16)

ml.pipeline.jobmap_sge(
    crest.conformer_search,
    source,
    destination,
    cache_dir="./conf_cache", #Where final outputs will be written, successful or not!
    scratch_dir="./scratch_dir", #Scratch Directory where calculations will be run
    kwargs={
        "method": "gfnff", #GFNFF method to be used
        "temp": 298.15, #Temperature to assume
        "chk_topo": True, #Will check topology
    }, #These are arguments used in the conformer_search function and can be specified directly
    progress=True, #Will print out progress
    verbose = True, #Will print out extra information
    qsub_header="#$ -pe orte 16\n", #This will
)

Example 3: Loading Encoded Output Files

In the event that there is additional information desired from a file or a library gets written incorrectly, the encoded output cache can be read from and certain methods can be used. An example of this is shown below:

#Necessary imports
import molli as ml
from glob import glob
from pathlib import Path
from tqdm import tqdm

#This is the file the Molecules are retrieved from
source = ml.MoleculeLibrary("example.mlib", readonly=True)

#This is the file the conformer ensembles calculated will be written to.
destination = ml.ConformerLibrary("example_result.clib", readonly=False)

#This reads and writes to the respective files
with source.reading(), destination.writing():
    for file in tqdm(glob('./conf_cache/output/*.out')):
        res = ml.pipeline.JobOutput.load(file) # Loads the Output file from the cache directory
        name = Path(file).stem #Gives name of file
        m = source[name] #Retrieves matching name from the source library

        #This retrieves the conformer geometry
        all_geoms = ml.CartesianGeometry.loads_all_xyz(
            res.files["crest_conformers.xyz"].decode()
        )

        # This creates a conformer ensemble
        result = ml.ConformerEnsemble(m, n_conformers=len(all_geoms))

        # This updates the coordinates of all the conformers
        for blank_conf, conf_geom in zip(result, all_geoms):
            blank_conf.coords = conf_geom.coords

        destination[name] = result

External Drivers and Available Methods

  • ORCA

    • All functions utilize a template to create and format ORCA input files. A parser identifies various properties from the m_orca_property.txt file using regular expressions. This has been tested on ORCA 5.0 and higher. The parser identifies various properties from this file: updated coordinates, SCF Energy or VDW Corrections, Mayer population analysis, MDCI_Energies, solvation details, dipole moments, DFT energy, calculated NMR shifts, calculated Hessians, and thermochemistry values

    • Methods Available:

      • basic_calc_m - allows implementation of various routine calculations, including single point energy calculations, geometry optimizations, vibrational frequency calculations, etc.

      • optimize_ens - same as basic_calc_m but operates on ConformerEnsemble insead of a Molecule

      • giao_nmr_m - allows calculation of NMR shifts for specified elements

      • giao_nmr_ens - same as giao_nmr_m but operates on a ConformerEnsemble insead of a Molecule

      • scan_dihedral - calculates a potential energy surface scan for a 360\(\degree\) rotation around four atoms of interest. Returns a ConformerEnsemble of each step

  • CREST

    • These functions support forwarding of CREST command-line parameters, such as the XTB method, temperature, energy window, the length of the metadynamics simulation, the dump frequency at which coordinates are written to the trajectory file, the dump frequency in which coordinates are given to the variable reference structure list, and checking for changes in topology. These functions allow further miscellaneous command specification.

    • Methods Available:

      • conformer_search - runs a general conformational search on a Molecule and creates a ConformerEnsemble

      • conformer_screen - runs the screen ensemble optimization protocol in CREST for a ConformerEnsemble. This optimizes points along a trajectory and then sorts the conformer ensemble based on energy, rotational constants, and Cartesian RMSDs with a specified XTB method

  • XTB

    • These functions support forwarding of the XTB command line arguments. A custom input file can be specified, although the user is responsible for setting up its contents.

    • Methods Available:

      • optimize_m - performs an XTB optimization of a Molecule and returns a new Molecule instance with the optimized coordinates

      • optimize_ens - same as optimize_m but operates on a ConformerEnsemble instead of a Molecule

      • energy_m - performs an XTB energy calculation and adds this as an attribute to the Molecule

      • scan_dihedral - calculates a potential energy surface scan for a 360\(\degree\) rotation around four atoms of interest. Returns a ConformerEnsemble of each step

      • atom_properties_m - runs an XTB energy calculation and includes a parser that identifies calculated properties from the output file. The properties parsed for each atom are dispersion, polarizability, charge, covalent coordination number, three Fukui indices, and Wiberg bond index. These are stored as properties of each Atom in the Molecule returned.

  • NWChem

    • This method utilizes an NWChem template to help create and format NWChem input files.

    -Methods Available: - optimize_atomic_esp_charges_m - calculates charges for each atom after calculation of an electrostatic potential. If specified, a DFT optimization can be run before electrostatic potential calculation. In addition, the electrostatic minimum and maximum of the collective grid points can be stored as an attribute of the Molecule.

Structure of Jobs

The molli jobmap function takes an input library, a Job to be performed on the members of the library, the output library to be written to, a cache directory to write intermediate outputs, and a scratch directory to run each Job. The Molecule or ConformerEnsemble objects will be serialized in the new objects. Methods that are operate with the Job class require the specification of two methods: prep and post.

  • prep - operates on a Molecule or ConformerEnsemble from the library and creates a JobInput. This will have various attributes including a Job ID, a list of commands to be run, specifications of output streams, and files to be cached.

  • post- takes the method in the driver, redefines the methhod for the final step of the Job to process the output file. This will take an encoded output file, attempt to create a JobOutput, and execute the new method to create an object with the updated attributes if the Job was performed as expected.

jobmap takes all instances of JobInput, passes them to a ThreadPoolExecutor, and then splits them based on the number of workers requested. For example, if 4 cores are requested with 4 workers, this will partition the submissions such that 16 cores of the CPU will be used. While CPython typically does not gain performance with thread-based parallelism on CPU-bound tasks, this task can be effectively considered an I/O-bound task since the computing is done by an external process. Upon completion of a job, the outputs will be collected and encoded in a single .out file.

jobmap_sge functions the same as jobmap, with the main difference is that it was created for submission of jobs through a scheduler. Different configurations of clusters may require additional specification in the submission, so an additional option for a header in the batch submission is available. This function will monitor Job IDs as they complete on the cluster and still capture respective outputs requested from the Job.

In theory, any function can be submitted to molli maintaining the syntax seen in varying drivers. This was designed to allow varying drivers and interfaces to be implemented