{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# `molli` Study case: Cladosporin GIAO-DFT NMR prediction\n", "\n", "## Objective\n", "The objective of this task is to use the `molli` workflow to transition from `.CDXML` molecular drawings to full-fledged 3D structures with subsequent CREST//GFN2-XTB conformer generation, ORCA equilibrium structure minimizations and GIAO-DFT NMR predictions.\n", "This workflow is particularly useful in validation of relative stereochemistry where experimental NMR is not conclusive alone (see e. g. [*J. Org. Chem. 2020*, **85**, 5, 3297](doi.org/10.1021/acs.joc.9b03129))\n", "\n", "## Prerequisites\n", "- `molli` installation\n", "- `pandas` installation (it is NOT a dependency of molli and must be installed separately)\n", "- `OpenBabel` installation (it is NOT a dependency of molli and must be installed separately)\n", " Note: if you experience problems with openbabel refusing to optimize the structures, do not forget to set the `BABEL_DATADIR` environment variable to point to a valid location with openbabel data files. With conda installations, it seems to be consistently pointing to `$CONDA_PREFIX/share/openbabel`\n", "- `openpyxl` installation (it is NOT a dependency of molli and must be installed separately)\n", "- `rmsd` installation (it is NOT a dependency of molli and must be installed separately)\n", "- `cladosporin.csv` file that has the experimental NMR chemical shifts\n", "- `cladosporin.cdxml` file that contains the structural drawings\n", "\n", "## Methods\n", "\n", "### Conformer generation\n", "Conformers are generated with iMTD-GC workflow as implemented in CREST. \n", "\n", "### " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import molli as ml\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# This configures on-screen preview of molecules\n", "ml.visual.configure(bgcolor=\"black\", theme=\"dark\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1. `CDXML` Parsing\n", "\n", "In this step, all of the molecules are read from the CDXML file and converted into their 3D versions.\n", "This can be done from both the command line interface, as well as programmatically. Here, the command line interface will be shown." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parsing cladosporin.cdxml: 100%|█████████████████| 4/4 [00:00<00:00, 148.00it/s]\n" ] } ], "source": [ "!molli parse cladosporin.cdxml -o cladosporin_g0.mlib --hadd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now use IPython line magic `mlib_view` to visualize the entries in the resulting library." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "application/3dmoljs_load.v0": "
\n

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n jupyter labextension install jupyterlab_3dmol

\n
\n", "text/html": [ "
\n", "

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n", " jupyter labextension install jupyterlab_3dmol

\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/3dmoljs_load.v0": "", "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/3dmoljs_load.v0": "", "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%mlib_view cladosporin_g0.mlib cladosporin_1a " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2. OpenBabel preliminary optimization\n", "In this step we will perform a preliminary optimization of these molecules with OpenBabel's implementation of MMFF94 force field. This typically removes parsing artifacts resulting in unreasonable geometries which *may* break GFN2-XTB optimizations from time to time.\n", "\n", "Important: do inspect the resulting structures as they do not always converge to the necessary results." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "application/3dmoljs_load.v0": "
\n

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n jupyter labextension install jupyterlab_3dmol

\n
\n", "text/html": [ "
\n", "

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n", " jupyter labextension install jupyterlab_3dmol

\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/3dmoljs_load.v0": "", "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Molecule(name='cladosporin_1d', formula='C16 H20 O5')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from molli.external.openbabel import obabel_optimize\n", "\n", "source_lib = ml.MoleculeLibrary(\"cladosporin_g0.mlib\")\n", "obabel_preopt = ml.MoleculeLibrary(\"cladosporin_g1.mlib\", overwrite=True, readonly=False)\n", "\n", "\n", "# NOTE: in certain unreasonable starting geometries, such as linear C-O-H geometries, planar tetrahedral atoms and many more \n", "# force field gradients corresponding to certain internal degrees of freedom seem to diverge\n", "# this results in the inability of the corresponding forcefield to perform the optimization\n", "# the only solution we were able to come up with to ameliorate this situation is to displace EVERY atom by a random vector of a certain magnitude\n", "\n", "\n", "# These guards may look a bit clumsy\n", "# but they become enable synchronization of parallel processes working on the same libraries\n", "with source_lib.reading(), obabel_preopt.writing():\n", " for k in source_lib.keys():\n", " optimized = obabel_optimize(\n", " mol=source_lib[k],\n", " ff=\"mmff94\",\n", " max_steps=1000,\n", " tol=1e-4,\n", " coord_displace=0.1, # see note above\n", " inplace=False\n", " )\n", "\n", " obabel_preopt[k] = optimized\n", "\n", "optimized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3. GFN2-XTB Conformer generation\n", "\n", "In this step the conformers of molecules in question will be generated. For this we will be utilizing `molli.pipeline` module. While it is still in development, it allows to simplify parallel calculations with intermediate result caching.\n", "\n", "**Note** The code for this is in a separate file because jupyter notebook does not play nice with the thread pool executor." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running conformer generation with: crest 32\n", "debug: nprocs 32\n", "====================================================================================================\n", "Submitting jobs: 0it [00:00, ?it/s]\n", "Waiting for jobs: 0it [00:00, ?it/s]\n", "Finalizing the calculations: 0it [00:00, ?it/s]\n", "====================================================================================================\n", "Submitting jobs: 0it [00:00, ?it/s]\n", "Waiting for jobs: 0it [00:00, ?it/s]\n", "Finalizing the calculations: 0it [00:00, ?it/s]\n", "====================================================================================================\n", "Starting a molli.pipeline.jobmap_sge calculation:\n", "calculation: ORCADriver.optimize_ens\n", "None\n", " source: ConformerLibrary(backend=UkvCollectionBackend('cladosporin_g3.clib'), n_items=4)\n", "destination: ConformerLibrary(backend=UkvCollectionBackend('cladosporin_g4.clib'), n_items=0)\n", "scratch dir: /home/blakeo2/.molli/scratch\n", "Total number of jobs: 4\n", "Exist in destination: 0\n", "To be computed: 4\n", "Submitting jobs: 0it [00:00, ?it/s]\n", "Waiting for jobs: 0it [00:00, ?it/s]\n", "Finalizing the calculations: 100%|████████████████| 4/4 [00:03<00:00, 1.04it/s]\n", "====================================================================================================\n", "Starting a molli.pipeline.jobmap_sge calculation:\n", "calculation: ORCADriver.giao_nmr_ens\n", "None\n", " source: ConformerLibrary(backend=UkvCollectionBackend('cladosporin_g4.clib'), n_items=4)\n", "destination: ConformerLibrary(backend=UkvCollectionBackend('cladosporin_g4_nmr.clib'), n_items=0)\n", "scratch dir: /home/blakeo2/.molli/scratch\n", "Total number of jobs: 4\n", "Exist in destination: 0\n", "To be computed: 4\n", "Submitting jobs: 0it [00:00, ?it/s]\n", "Waiting for jobs: 0it [00:00, ?it/s]\n", "Finalizing the calculations: 100%|████████████████| 4/4 [00:00<00:00, 7.07it/s]\n" ] } ], "source": [ "!python workflow.py" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The output path is cladosporin_g3a.clib\n", "matching structures:\n", "100%|█████████████████████████████████████████████| 4/4 [00:00<00:00, 82.68it/s]\n", "bringing to an optimal rotation:\n", "100%|█████████████████████████████████████████████| 4/4 [00:00<00:00, 22.36it/s]\n", "Conformers checked: 340, maximal rmsd: 0.165, minimal rmsd: 0.042\n", "Median rmsd: 0.095\n" ] } ], "source": [ "# This is not a necessary step but it does make our lives easier\n", "!molli align -i cladosporin_g3.clib -o cladosporin_g3a.clib -q align_qry.mol2" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The output path is cladosporin_g4a.clib\n", "matching structures:\n", "100%|█████████████████████████████████████████████| 4/4 [00:00<00:00, 76.89it/s]\n", "bringing to an optimal rotation:\n", "100%|█████████████████████████████████████████████| 4/4 [00:00<00:00, 21.98it/s]\n", "Conformers checked: 335, maximal rmsd: 0.088, minimal rmsd: 0.046\n", "Median rmsd: 0.07\n" ] }, { "data": { "application/3dmoljs_load.v0": "
\n

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n jupyter labextension install jupyterlab_3dmol

\n
\n", "text/html": [ "
\n", "

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n", " jupyter labextension install jupyterlab_3dmol

\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/3dmoljs_load.v0": "", "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/3dmoljs_load.v0": "", "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "!molli align -i cladosporin_g4.clib -o cladosporin_g4a.clib -q align_qry.mol2\n", "%clib_view cladosporin_g4a.clib cladosporin_1c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing NMR\n", "\n", "NMR chemical shifts will be computed with linear correction according to $ \\delta_i = k \\sigma_i + b $, where $k = -0.9127, b = 170.87$ ." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Computing NMR for conformer ensemble 'cladosporin_1a'\n", " nuclei expt_nmr pred_nmr error\n", "0 1 169.9 168.9 -1.0\n", "1 3 76.3 75.6 -0.7\n", "2 4 33.6 35.0 1.4\n", "3 4a 141.8 143.9 2.1\n", "4 5 106.7 103.6 -3.1\n", "5 6 163.1 161.8 -1.3\n", "6 7 102.0 98.9 -3.1\n", "7 8 164.3 164.3 0.0\n", "8 8a 101.5 100.0 -1.5\n", "9 9 39.3 39.1 -0.2\n", "10 10 66.6 65.9 -0.7\n", "11 11 30.9 31.6 0.7\n", "12 12 18.1 20.1 2.0\n", "13 13 30.9 32.1 1.2\n", "14 14 68.0 67.7 -0.3\n", "15 15 18.9 19.1 0.2\n", "--------------------\n", "ConformerEnsemble(name='cladosporin_1a', formula='C16 H20 O5'), ensemble.n_conformers=85\n", "Highest Boltz weight: 24.329%\n", "Conformers (> 1% wt): 16\n", "MAE= 1.2\n", "RMS= 1.5\n", "MAX= 3.1\n", "================================================================================\n", "Computing NMR for conformer ensemble 'cladosporin_1b'\n", " nuclei expt_nmr pred_nmr error\n", "0 1 170.0 169.0 -1.0\n", "1 3 76.6 76.6 0.0\n", "2 4 32.6 33.7 1.1\n", "3 4a 141.6 143.8 2.2\n", "4 5 106.7 103.7 -3.0\n", "5 6 163.0 161.8 -1.2\n", "6 7 102.0 98.9 -3.1\n", "7 8 164.4 164.3 -0.1\n", "8 8a 101.6 100.0 -1.6\n", "9 9 37.3 37.5 0.2\n", "10 10 67.2 67.1 -0.1\n", "11 11 29.8 30.4 0.6\n", "12 12 18.1 20.1 2.0\n", "13 13 31.5 32.4 0.9\n", "14 14 67.4 67.5 0.1\n", "15 15 19.7 19.6 -0.1\n", "--------------------\n", "ConformerEnsemble(name='cladosporin_1b', formula='C16 H20 O5'), ensemble.n_conformers=113\n", "Highest Boltz weight: 15.161%\n", "Conformers (> 1% wt): 20\n", "MAE= 1.1\n", "RMS= 1.5\n", "MAX= 3.1\n", "================================================================================\n", "Computing NMR for conformer ensemble 'cladosporin_1c'\n", " nuclei expt_nmr pred_nmr error\n", "0 1 171.7 168.9 -2.8\n", "1 3 78.0 76.4 -1.6\n", "2 4 34.5 33.9 -0.6\n", "3 4a 143.6 143.7 0.1\n", "4 5 108.1 103.7 -4.4\n", "5 6 165.8 161.8 -4.0\n", "6 7 102.3 98.9 -3.4\n", "7 8 166.4 164.3 -2.1\n", "8 8a 101.8 100.0 -1.8\n", "9 9 42.2 41.3 -0.9\n", "10 10 75.2 73.4 -1.8\n", "11 11 32.5 32.0 -0.5\n", "12 12 22.6 25.3 2.7\n", "13 13 33.8 33.9 0.1\n", "14 14 75.5 73.8 -1.7\n", "15 15 24.7 22.2 -2.5\n", "--------------------\n", "ConformerEnsemble(name='cladosporin_1c', formula='C16 H20 O5'), ensemble.n_conformers=58\n", "Highest Boltz weight: 24.882%\n", "Conformers (> 1% wt): 15\n", "MAE= 1.9\n", "RMS= 2.3\n", "MAX= 4.4\n", "================================================================================\n", "Computing NMR for conformer ensemble 'cladosporin_1d'\n", " nuclei expt_nmr pred_nmr error\n", "0 1 171.7 168.9 -2.8\n", "1 3 77.6 75.3 -2.3\n", "2 4 34.6 35.1 0.5\n", "3 4a 143.7 143.9 0.2\n", "4 5 108.0 103.7 -4.3\n", "5 6 165.8 161.8 -4.0\n", "6 7 102.3 98.9 -3.4\n", "7 8 166.4 164.2 -2.2\n", "8 8a 101.7 99.9 -1.8\n", "9 9 43.0 42.5 -0.5\n", "10 10 74.8 72.2 -2.6\n", "11 11 33.1 32.7 -0.4\n", "12 12 22.5 25.4 2.9\n", "13 13 34.5 34.1 -0.4\n", "14 14 75.3 73.6 -1.7\n", "15 15 24.8 22.1 -2.7\n", "--------------------\n", "ConformerEnsemble(name='cladosporin_1d', formula='C16 H20 O5'), ensemble.n_conformers=79\n", "Highest Boltz weight: 49.811%\n", "Conformers (> 1% wt): 8\n", "MAE= 2.0\n", "RMS= 2.4\n", "MAX= 4.3\n", "================================================================================\n" ] } ], "source": [ "expt_nmr = pd.read_excel(\"cladosporin-expt.xlsx\")\n", "\n", "nuclei = [str(x) for x in expt_nmr[\"Unnamed: 0\"]]\n", "\n", "mol_data_map = {\n", " \"cladosporin_1a\": \"exp_1a\",\n", " \"cladosporin_1b\": \"exp_1b\",\n", " \"cladosporin_1c\": \"exp_1c\",\n", " \"cladosporin_1d\": \"exp_1d\",\n", "}\n", "\n", "# See htpps://doi.org/10.1021/jacs.0c12569 --> Supporting information\n", "# for more details about the scaling factor derivations\n", "SCALE_K = -0.9127\n", "SCALE_B = 170.87\n", "\n", "RT = 8.314 * 0.298\n", "\n", "\n", "library = ml.ConformerLibrary(\"cladosporin_g4_nmr.clib\")\n", "\n", "with library.reading():\n", " for mol_name, exp_name in mol_data_map.items():\n", " ensemble = library[mol_name]\n", " for a in ensemble.atoms:\n", " if a.element == ml.Element.C and a.label is None:\n", " a.label = \"15\"\n", "\n", " print(f\"Computing NMR for conformer ensemble {mol_name!r}\")\n", " energies = np.array(\n", " [d[\"ORCA/SCF_Energy\"] for d in ensemble.attrib[\"conformer_properties\"]]\n", " )\n", "\n", " rel_energies = 2625.5*(energies - np.min(energies))\n", "\n", " boltz_f = np.exp(rel_energies / (-RT))\n", " boltz_w = boltz_f / np.sum(boltz_f)\n", "\n", " shieldings_w = np.array(\n", " [\n", " np.average(a.attrib[\"NMR_shielding\"], weights=boltz_w)\n", " for a in ensemble.get_atoms(*nuclei)\n", " ]\n", " )\n", " nmr_w = np.round(shieldings_w * SCALE_K + SCALE_B, 1)\n", "\n", " err = nmr_w - expt_nmr[exp_name]\n", "\n", " result = pd.DataFrame(\n", " data={\n", " \"nuclei\": nuclei,\n", " \"expt_nmr\": expt_nmr[exp_name],\n", " \"pred_nmr\": nmr_w,\n", " \"error\": err,\n", " }\n", " )\n", "\n", " print(result)\n", " print(\"-\" * 20)\n", " print(f\"{ensemble!r}, {ensemble.n_conformers=}\")\n", " print(f\"Highest Boltz weight: {np.max(boltz_w):0.3%}\")\n", " print(f\"Conformers (> 1% wt): {np.count_nonzero(boltz_w > 0.01)}\")\n", " print(f\"MAE= {np.average(np.abs(err)):5.1f}\")\n", " print(f\"RMS= {np.sqrt(np.average(err**2)):5.1f}\")\n", " print(f\"MAX= {np.max(np.abs(err)):5.1f}\")\n", " print(\"=\" * 80)" ] } ], "metadata": { "kernelspec": { "display_name": "dev-blake", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 2 }