molli Study case: Cladosporin GIAO-DFT NMR prediction¶
Objective¶
The objective of this task is to use the molli workflow to transition from .CDXML molecular drawings to full-fledged 3D structures with subsequent CREST//GFN2-XTB conformer generation, ORCA equilibrium structure minimizations and GIAO-DFT NMR predictions. This workflow is particularly useful in validation of relative stereochemistry where experimental NMR is not conclusive alone (see e. g. J. Org. Chem. 2020, 85, 5, 3297)
Prerequisites¶
molliinstallationpandasinstallation (it is NOT a dependency of molli and must be installed separately)OpenBabelinstallation (it is NOT a dependency of molli and must be installed separately) Note: if you experience problems with openbabel refusing to optimize the structures, do not forget to set theBABEL_DATADIRenvironment variable to point to a valid location with openbabel data files. With conda installations, it seems to be consistently pointing to$CONDA_PREFIX/share/openbabelopenpyxlinstallation (it is NOT a dependency of molli and must be installed separately)rmsdinstallation (it is NOT a dependency of molli and must be installed separately)cladosporin.csvfile that has the experimental NMR chemical shiftscladosporin.cdxmlfile that contains the structural drawings
Methods¶
Conformer generation¶
Conformers are generated with iMTD-GC workflow as implemented in CREST.
[6]:
import molli as ml
import pandas as pd
import numpy as np
# This configures on-screen preview of molecules
ml.visual.configure(bgcolor="black", theme="dark")
Step 1. CDXML Parsing¶
In this step, all of the molecules are read from the CDXML file and converted into their 3D versions. This can be done from both the command line interface, as well as programmatically. Here, the command line interface will be shown.
[7]:
!molli parse cladosporin.cdxml -o cladosporin_g0.mlib --hadd
Parsing cladosporin.cdxml: 100%|█████████████████| 4/4 [00:00<00:00, 148.00it/s]
We will now use IPython line magic mlib_view to visualize the entries in the resulting library.
[8]:
%mlib_view cladosporin_g0.mlib cladosporin_1a
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol
Step 2. OpenBabel preliminary optimization¶
In this step we will perform a preliminary optimization of these molecules with OpenBabel’s implementation of MMFF94 force field. This typically removes parsing artifacts resulting in unreasonable geometries which may break GFN2-XTB optimizations from time to time.
Important: do inspect the resulting structures as they do not always converge to the necessary results.
[9]:
from molli.external.openbabel import obabel_optimize
source_lib = ml.MoleculeLibrary("cladosporin_g0.mlib")
obabel_preopt = ml.MoleculeLibrary("cladosporin_g1.mlib", overwrite=True, readonly=False)
# NOTE: in certain unreasonable starting geometries, such as linear C-O-H geometries, planar tetrahedral atoms and many more
# force field gradients corresponding to certain internal degrees of freedom seem to diverge
# this results in the inability of the corresponding forcefield to perform the optimization
# the only solution we were able to come up with to ameliorate this situation is to displace EVERY atom by a random vector of a certain magnitude
# These guards may look a bit clumsy
# but they become enable synchronization of parallel processes working on the same libraries
with source_lib.reading(), obabel_preopt.writing():
for k in source_lib.keys():
optimized = obabel_optimize(
mol=source_lib[k],
ff="mmff94",
max_steps=1000,
tol=1e-4,
coord_displace=0.1, # see note above
inplace=False
)
obabel_preopt[k] = optimized
optimized
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol
[9]:
Molecule(name='cladosporin_1d', formula='C16 H20 O5')
Step 3. GFN2-XTB Conformer generation¶
In this step the conformers of molecules in question will be generated. For this we will be utilizing molli.pipeline module. While it is still in development, it allows to simplify parallel calculations with intermediate result caching.
Note The code for this is in a separate file because jupyter notebook does not play nice with the thread pool executor.
[10]:
!python workflow.py
Running conformer generation with: crest 32
debug: nprocs 32
====================================================================================================
Submitting jobs: 0it [00:00, ?it/s]
Waiting for jobs: 0it [00:00, ?it/s]
Finalizing the calculations: 0it [00:00, ?it/s]
====================================================================================================
Submitting jobs: 0it [00:00, ?it/s]
Waiting for jobs: 0it [00:00, ?it/s]
Finalizing the calculations: 0it [00:00, ?it/s]
====================================================================================================
Starting a molli.pipeline.jobmap_sge calculation:
calculation: ORCADriver.optimize_ens
None
source: ConformerLibrary(backend=UkvCollectionBackend('cladosporin_g3.clib'), n_items=4)
destination: ConformerLibrary(backend=UkvCollectionBackend('cladosporin_g4.clib'), n_items=0)
scratch dir: /home/blakeo2/.molli/scratch
Total number of jobs: 4
Exist in destination: 0
To be computed: 4
Submitting jobs: 0it [00:00, ?it/s]
Waiting for jobs: 0it [00:00, ?it/s]
Finalizing the calculations: 100%|████████████████| 4/4 [00:03<00:00, 1.04it/s]
====================================================================================================
Starting a molli.pipeline.jobmap_sge calculation:
calculation: ORCADriver.giao_nmr_ens
None
source: ConformerLibrary(backend=UkvCollectionBackend('cladosporin_g4.clib'), n_items=4)
destination: ConformerLibrary(backend=UkvCollectionBackend('cladosporin_g4_nmr.clib'), n_items=0)
scratch dir: /home/blakeo2/.molli/scratch
Total number of jobs: 4
Exist in destination: 0
To be computed: 4
Submitting jobs: 0it [00:00, ?it/s]
Waiting for jobs: 0it [00:00, ?it/s]
Finalizing the calculations: 100%|████████████████| 4/4 [00:00<00:00, 7.07it/s]
[11]:
# This is not a necessary step but it does make our lives easier
!molli align -i cladosporin_g3.clib -o cladosporin_g3a.clib -q align_qry.mol2
The output path is cladosporin_g3a.clib
matching structures:
100%|█████████████████████████████████████████████| 4/4 [00:00<00:00, 82.68it/s]
bringing to an optimal rotation:
100%|█████████████████████████████████████████████| 4/4 [00:00<00:00, 22.36it/s]
Conformers checked: 340, maximal rmsd: 0.165, minimal rmsd: 0.042
Median rmsd: 0.095
[12]:
!molli align -i cladosporin_g4.clib -o cladosporin_g4a.clib -q align_qry.mol2
%clib_view cladosporin_g4a.clib cladosporin_1c
The output path is cladosporin_g4a.clib
matching structures:
100%|█████████████████████████████████████████████| 4/4 [00:00<00:00, 76.89it/s]
bringing to an optimal rotation:
100%|█████████████████████████████████████████████| 4/4 [00:00<00:00, 21.98it/s]
Conformers checked: 335, maximal rmsd: 0.088, minimal rmsd: 0.046
Median rmsd: 0.07
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol
Computing NMR¶
NMR chemical shifts will be computed with linear correction according to $ \delta_i = k \sigma_i + b $, where \(k = -0.9127, b = 170.87\) .
[13]:
expt_nmr = pd.read_excel("cladosporin-expt.xlsx")
nuclei = [str(x) for x in expt_nmr["Unnamed: 0"]]
mol_data_map = {
"cladosporin_1a": "exp_1a",
"cladosporin_1b": "exp_1b",
"cladosporin_1c": "exp_1c",
"cladosporin_1d": "exp_1d",
}
# See htpps://doi.org/10.1021/jacs.0c12569 --> Supporting information
# for more details about the scaling factor derivations
SCALE_K = -0.9127
SCALE_B = 170.87
RT = 8.314 * 0.298
library = ml.ConformerLibrary("cladosporin_g4_nmr.clib")
with library.reading():
for mol_name, exp_name in mol_data_map.items():
ensemble = library[mol_name]
for a in ensemble.atoms:
if a.element == ml.Element.C and a.label is None:
a.label = "15"
print(f"Computing NMR for conformer ensemble {mol_name!r}")
energies = np.array(
[d["ORCA/SCF_Energy"] for d in ensemble.attrib["conformer_properties"]]
)
rel_energies = 2625.5*(energies - np.min(energies))
boltz_f = np.exp(rel_energies / (-RT))
boltz_w = boltz_f / np.sum(boltz_f)
shieldings_w = np.array(
[
np.average(a.attrib["NMR_shielding"], weights=boltz_w)
for a in ensemble.get_atoms(*nuclei)
]
)
nmr_w = np.round(shieldings_w * SCALE_K + SCALE_B, 1)
err = nmr_w - expt_nmr[exp_name]
result = pd.DataFrame(
data={
"nuclei": nuclei,
"expt_nmr": expt_nmr[exp_name],
"pred_nmr": nmr_w,
"error": err,
}
)
print(result)
print("-" * 20)
print(f"{ensemble!r}, {ensemble.n_conformers=}")
print(f"Highest Boltz weight: {np.max(boltz_w):0.3%}")
print(f"Conformers (> 1% wt): {np.count_nonzero(boltz_w > 0.01)}")
print(f"MAE= {np.average(np.abs(err)):5.1f}")
print(f"RMS= {np.sqrt(np.average(err**2)):5.1f}")
print(f"MAX= {np.max(np.abs(err)):5.1f}")
print("=" * 80)
Computing NMR for conformer ensemble 'cladosporin_1a'
nuclei expt_nmr pred_nmr error
0 1 169.9 168.9 -1.0
1 3 76.3 75.6 -0.7
2 4 33.6 35.0 1.4
3 4a 141.8 143.9 2.1
4 5 106.7 103.6 -3.1
5 6 163.1 161.8 -1.3
6 7 102.0 98.9 -3.1
7 8 164.3 164.3 0.0
8 8a 101.5 100.0 -1.5
9 9 39.3 39.1 -0.2
10 10 66.6 65.9 -0.7
11 11 30.9 31.6 0.7
12 12 18.1 20.1 2.0
13 13 30.9 32.1 1.2
14 14 68.0 67.7 -0.3
15 15 18.9 19.1 0.2
--------------------
ConformerEnsemble(name='cladosporin_1a', formula='C16 H20 O5'), ensemble.n_conformers=85
Highest Boltz weight: 24.329%
Conformers (> 1% wt): 16
MAE= 1.2
RMS= 1.5
MAX= 3.1
================================================================================
Computing NMR for conformer ensemble 'cladosporin_1b'
nuclei expt_nmr pred_nmr error
0 1 170.0 169.0 -1.0
1 3 76.6 76.6 0.0
2 4 32.6 33.7 1.1
3 4a 141.6 143.8 2.2
4 5 106.7 103.7 -3.0
5 6 163.0 161.8 -1.2
6 7 102.0 98.9 -3.1
7 8 164.4 164.3 -0.1
8 8a 101.6 100.0 -1.6
9 9 37.3 37.5 0.2
10 10 67.2 67.1 -0.1
11 11 29.8 30.4 0.6
12 12 18.1 20.1 2.0
13 13 31.5 32.4 0.9
14 14 67.4 67.5 0.1
15 15 19.7 19.6 -0.1
--------------------
ConformerEnsemble(name='cladosporin_1b', formula='C16 H20 O5'), ensemble.n_conformers=113
Highest Boltz weight: 15.161%
Conformers (> 1% wt): 20
MAE= 1.1
RMS= 1.5
MAX= 3.1
================================================================================
Computing NMR for conformer ensemble 'cladosporin_1c'
nuclei expt_nmr pred_nmr error
0 1 171.7 168.9 -2.8
1 3 78.0 76.4 -1.6
2 4 34.5 33.9 -0.6
3 4a 143.6 143.7 0.1
4 5 108.1 103.7 -4.4
5 6 165.8 161.8 -4.0
6 7 102.3 98.9 -3.4
7 8 166.4 164.3 -2.1
8 8a 101.8 100.0 -1.8
9 9 42.2 41.3 -0.9
10 10 75.2 73.4 -1.8
11 11 32.5 32.0 -0.5
12 12 22.6 25.3 2.7
13 13 33.8 33.9 0.1
14 14 75.5 73.8 -1.7
15 15 24.7 22.2 -2.5
--------------------
ConformerEnsemble(name='cladosporin_1c', formula='C16 H20 O5'), ensemble.n_conformers=58
Highest Boltz weight: 24.882%
Conformers (> 1% wt): 15
MAE= 1.9
RMS= 2.3
MAX= 4.4
================================================================================
Computing NMR for conformer ensemble 'cladosporin_1d'
nuclei expt_nmr pred_nmr error
0 1 171.7 168.9 -2.8
1 3 77.6 75.3 -2.3
2 4 34.6 35.1 0.5
3 4a 143.7 143.9 0.2
4 5 108.0 103.7 -4.3
5 6 165.8 161.8 -4.0
6 7 102.3 98.9 -3.4
7 8 166.4 164.2 -2.2
8 8a 101.7 99.9 -1.8
9 9 43.0 42.5 -0.5
10 10 74.8 72.2 -2.6
11 11 33.1 32.7 -0.4
12 12 22.5 25.4 2.9
13 13 34.5 34.1 -0.4
14 14 75.3 73.6 -1.7
15 15 24.8 22.1 -2.7
--------------------
ConformerEnsemble(name='cladosporin_1d', formula='C16 H20 O5'), ensemble.n_conformers=79
Highest Boltz weight: 49.811%
Conformers (> 1% wt): 8
MAE= 2.0
RMS= 2.4
MAX= 4.3
================================================================================