{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parallelized Calculations and Jobmapping\n",
    "\n",
    "`molli` has implemented a `jobmap` function that enables the parallelized application of external drivers to `MoleculeLibrary` or `ConformerLibrary` objects. `molli` currently has 4 unique drivers for various geometry optimization, conformer generation, and property calculation methods. The `molli` `jobmap` function can be used to run parallelized calculations. These can be run through either a local computer or a cluster of computers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 1: Running a Job on a Local Computer\n",
    "\n",
    "An example script is shown below\n",
    "\n",
    "```python\n",
    "\n",
    "##Necessary imports\n",
    "import molli as ml\n",
    "from molli.pipeline.crest import CrestDriver\n",
    "\n",
    "#This is the file the Molecules are retrieved from\n",
    "source = ml.MoleculeLibrary(\"example.mlib\", readonly=True)\n",
    "\n",
    "#This is the file the conformer ensembles calculated will be written to.\n",
    "destination = ml.ConformerLibrary(\"example_result.clib\", readonly=False)\n",
    "\n",
    "#This configures the driver, number of processes to use for each worker. Can also indicate how much memory to use.\n",
    "crest = CrestDriver(\"crest\", nprocs=16)\n",
    "\n",
    "ml.pipeline.jobmap(\n",
    "    crest.conformer_search,\n",
    "    source=source, #Source of molecules\n",
    "    destination=destination, #Where conformers will be written\n",
    "    cache_dir=\"./conf_cache\", #Where final outputs will be written, successful or not!\n",
    "    scratch_dir=\"./scratch_dir\", #Scratch Directory where calculations will be run\n",
    "    n_workers=4, #Number of workers to use. In this case, 4 workers, each with 16 processors as defined in the driver.\n",
    "    kwargs={\n",
    "        \"method\": \"gfnff\", #GFNFF method to be used\n",
    "        \"temp\": 298.15, #Temperature to assume\n",
    "        \"chk_topo\": True, #Will check topology\n",
    "    }, #These are arguments used in the conformer_search function and can be specified directly\n",
    "    progress=True, #Will print out progress\n",
    "    verbose = True, #Will print out extra information\n",
    ")\n",
    "```\n",
    "\n",
    "This will create a Conformer Library with the path `example.clib`, and all inputs/outputs to `./conf_cache`. In the cache directory, there is an `input` folder which contains the formatted inputs used to submit calculations, as well as the `output` folder, which contains an encoded output (i.e. written in bytes)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 2: Running a Job on a Cluster\n",
    "\n",
    "In the likely event the user wants to use a computational cluster, a separate function was created for submission of jobs through the scheduler called `jobmap_sge`. This function was designed for use with clusters configured with the Oracle Grid Engine (also known as Sun Grid Engine) for batch submissions of jobs. This has the same functionality as `jobmap`, with the only deviation being that the collection of all `JobInput` instances is passed to a process that runs a `qsub` command instead of a local executor, and that `n_workers` no longer needs to be specified.\n",
    "\n",
    "\n",
    "\n",
    "```python\n",
    "#Necessary imports\n",
    "import molli as ml\n",
    "from molli.pipeline.crest import CrestDriver\n",
    "\n",
    "#This is the file the Molecules are retrieved from\n",
    "source = ml.MoleculeLibrary(\"example.mlib\", readonly=True)\n",
    "\n",
    "#This is the file the conformer ensembles calculated will be written to.\n",
    "destination = ml.ConformerLibrary(\"example_result.clib\", readonly=False)\n",
    "\n",
    "#This configures the driver, number of processes to use for each worker. Can also indicate how much memory to use.\n",
    "crest = CrestDriver(\"crest\", nprocs=16)\n",
    "\n",
    "ml.pipeline.jobmap_sge(\n",
    "    crest.conformer_search,\n",
    "    source,\n",
    "    destination,\n",
    "    cache_dir=\"./conf_cache\", #Where final outputs will be written, successful or not!\n",
    "    scratch_dir=\"./scratch_dir\", #Scratch Directory where calculations will be run\n",
    "    kwargs={\n",
    "        \"method\": \"gfnff\", #GFNFF method to be used\n",
    "        \"temp\": 298.15, #Temperature to assume\n",
    "        \"chk_topo\": True, #Will check topology\n",
    "    }, #These are arguments used in the conformer_search function and can be specified directly\n",
    "    progress=True, #Will print out progress\n",
    "    verbose = True, #Will print out extra information\n",
    "    qsub_header=\"#$ -pe orte 16\\n\", #This will \n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 3: Loading Encoded Output Files\n",
    "\n",
    "In the event that there is additional information desired from a file or a library gets written incorrectly, the encoded output cache can be read from and certain methods can be used. An example of this is shown below:\n",
    "\n",
    "\n",
    "```python\n",
    "#Necessary imports\n",
    "import molli as ml\n",
    "from glob import glob\n",
    "from pathlib import Path\n",
    "from tqdm import tqdm\n",
    "\n",
    "#This is the file the Molecules are retrieved from\n",
    "source = ml.MoleculeLibrary(\"example.mlib\", readonly=True)\n",
    "\n",
    "#This is the file the conformer ensembles calculated will be written to.\n",
    "destination = ml.ConformerLibrary(\"example_result.clib\", readonly=False)\n",
    "\n",
    "#This reads and writes to the respective files\n",
    "with source.reading(), destination.writing():\n",
    "    for file in tqdm(glob('./conf_cache/output/*.out')):\n",
    "        res = ml.pipeline.JobOutput.load(file) # Loads the Output file from the cache directory\n",
    "        name = Path(file).stem #Gives name of file\n",
    "        m = source[name] #Retrieves matching name from the source library\n",
    "\n",
    "        #This retrieves the conformer geometry\n",
    "        all_geoms = ml.CartesianGeometry.loads_all_xyz(\n",
    "            res.files[\"crest_conformers.xyz\"].decode()\n",
    "        )\n",
    "        \n",
    "        # This creates a conformer ensemble\n",
    "        result = ml.ConformerEnsemble(m, n_conformers=len(all_geoms))\n",
    "\n",
    "        # This updates the coordinates of all the conformers\n",
    "        for blank_conf, conf_geom in zip(result, all_geoms):\n",
    "            blank_conf.coords = conf_geom.coords\n",
    "\n",
    "        destination[name] = result\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## External Drivers and Available Methods\n",
    "\n",
    "- ORCA\n",
    "    - All functions utilize a template to create and format ORCA input files. A parser identifies various properties from the m_orca_property.txt file using regular expressions. This has been tested on ORCA 5.0 and higher. The parser identifies various properties from this file: updated coordinates, SCF Energy or VDW Corrections, Mayer population analysis, MDCI_Energies, solvation details, dipole moments, DFT energy, calculated NMR shifts, calculated Hessians, and thermochemistry values\n",
    "\n",
    "    - Methods Available:\n",
    "        - `basic_calc_m` - allows implementation of various routine calculations, including single point energy calculations, geometry optimizations, vibrational frequency calculations, etc.\n",
    "        - `optimize_ens` - same as `basic_calc_m` but operates on `ConformerEnsemble` insead of a `Molecule`\n",
    "        - `giao_nmr_m` - allows calculation of NMR shifts for specified elements\n",
    "        - `giao_nmr_ens` - same as `giao_nmr_m` but operates on a `ConformerEnsemble` insead of a `Molecule`\n",
    "        - `scan_dihedral` - calculates a potential energy surface scan for a 360$\\degree$ rotation around four atoms of interest. Returns a `ConformerEnsemble` of each step\n",
    "\n",
    "- CREST\n",
    "    - These functions support forwarding of CREST command-line parameters, such as the XTB method, temperature, energy window, the length of the metadynamics simulation, the dump frequency at which coordinates are written to the trajectory file, the dump frequency in which coordinates are given to the variable reference structure list, and checking for changes in topology. These functions allow further miscellaneous command specification.\n",
    "\n",
    "    - Methods Available:\n",
    "        - `conformer_search` - runs a general conformational search on a `Molecule` and creates a `ConformerEnsemble`\n",
    "        - `conformer_screen` - runs the screen ensemble optimization protocol in CREST for a ConformerEnsemble. This optimizes points along a trajectory and then sorts the conformer ensemble based on energy, rotational constants, and Cartesian RMSDs with a specified XTB method\n",
    "- XTB\n",
    "    - These functions support forwarding of the XTB command line arguments. A custom input file can be specified, although the user is responsible for setting up its contents.\n",
    "\n",
    "    - Methods Available:\n",
    "        - `optimize_m` - performs an XTB optimization of a Molecule and returns a new Molecule instance with the optimized coordinates\n",
    "        - `optimize_ens` - same as `optimize_m` but operates on a `ConformerEnsemble` instead of a `Molecule`\n",
    "        - `energy_m` - performs an XTB energy calculation and adds this as an attribute to the `Molecule`\n",
    "        - `scan_dihedral` - calculates a potential energy surface scan for a 360$\\degree$ rotation around four atoms of interest. Returns a `ConformerEnsemble` of each step\n",
    "        - `atom_properties_m` - runs an XTB energy calculation and includes a parser that identifies calculated properties from the output file. The properties parsed for each atom are dispersion, polarizability, charge, covalent coordination number, three Fukui indices, and Wiberg bond index. These are stored as properties of each Atom in the Molecule returned.\n",
    "\n",
    "\n",
    "- NWChem\n",
    "    - This method utilizes an NWChem template to help create and format NWChem input files.\n",
    "    \n",
    "    -Methods Available:\n",
    "        - `optimize_atomic_esp_charges_m` - calculates charges for each atom after calculation of an electrostatic potential. If specified, a DFT optimization can be run before electrostatic potential calculation. In addition, the electrostatic minimum and maximum of the collective grid points can be stored as an attribute of the `Molecule`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Structure of Jobs\n",
    "\n",
    "The `molli` `jobmap` function takes an input library, a `Job` to be performed on the members of the library, the output library to be written to, a cache directory to write intermediate outputs, and a scratch directory to run each `Job`. The `Molecule` or `ConformerEnsemble` objects will be serialized in the new objects. Methods that are operate with the `Job` class require the specification of two methods: `prep` and `post`.\n",
    "\n",
    "- `prep` - operates on a `Molecule` or `ConformerEnsemble` from the library and creates a `JobInput`. This will have various attributes including a Job ID, a list of commands to be run, specifications of output streams, and files to be cached. \n",
    "- `post`- takes the method in the driver, redefines the methhod for the final step of the `Job` to process the output file. This will take an encoded output file, attempt to create a `JobOutput`, and execute the new method to create an object with the updated attributes if the `Job` was performed as expected. \n",
    "\n",
    "`jobmap` takes all instances of `JobInput`, passes them to a ThreadPoolExecutor, and then splits them based on the number of workers requested. For example, if 4 cores are requested with 4 workers, this will partition the submissions such that 16 cores of the CPU will be used. While CPython typically does not gain performance with thread-based parallelism on CPU-bound tasks, this task can be effectively considered an I/O-bound task since the computing is done by an external process.  Upon completion of a job, the outputs will be collected and encoded in a single .out file.\n",
    "\n",
    "`jobmap_sge` functions the same as `jobmap`, with the main difference is that it was created for submission of jobs through a scheduler. Different configurations of clusters may require additional specification in the submission, so an additional option for a header in the batch submission is available. This function will monitor Job IDs as they complete on the cluster and still capture respective outputs requested from the Job.\n",
    "\n",
    "In theory, any function can be submitted to molli maintaining the syntax seen in varying drivers. This was designed to allow varying drivers and interfaces to be implemented"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dev-blake",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}