Example 2: Recompile Crude GeomSet¶
Objectives¶
In this notebook we show the workflow that compiles the data from one published dataset Key references
Prerequisites¶
pandaspy3Dmol
No additional files, besides this notebook, will be required. However, if you would like to manually download the molecule_net.tar.gz file from the server, therefore bypassing one of the steps here, you are welcome to do so.
Hardware Specification for Rerun¶
Desktop workstation with 2x (AMD EPYC 7702 64-Core) with total of 128 physical and 256 logical cores, 1024 GB DDR4 with Ubuntu 22.04 LTS operating system.
[5]:
# Imports required to execute this notebook
import molli as ml
try:
import ujson as json
except:
import json
import msgpack
import numpy as np
from pathlib import Path
from tqdm.notebook import tqdm
from pathlib import Path
import tarfile
ml.visual.configure(bgcolor="white")
# This is to suppress warnings
from openbabel import pybel
pybel.ob.obErrorLog.SetOutputLevel(0)
Step 1. Download the data archive¶
drugs_crude.msgpack.tar.gz
[6]:
# Definitions of key paths
drugs_crude_targz = Path("drugs_crude.msgpack.tar.gz")
drugs_crude_mpack = Path("drugs_crude.msgpack")
Download the required molecule_net dataset. This is done manually in this notebook to make sure the workflow would be reproducible on both Windows ans Linux
[7]:
if not drugs_crude_targz.is_file():
import requests
with requests.get("https://dataverse.harvard.edu/api/access/datafile/4360331", stream=True) as rq:
rq.raise_for_status()
with open(drugs_crude_targz, "wb") as f:
for chunk in rq.iter_content(128*1024*1024): # iterate over data in 128 MiB chunks
f.write(chunk)
[8]:
if not drugs_crude_mpack.is_file():
with tarfile.open(drugs_crude_targz, "r:gz") as tf:
tf.extractall()
Step 2. Convert the data to molli .clib format¶
Now that we have the raw data, we will reimport it in molli format. The advantages of such storage technique are:
Lightweight file format (the reinterpreted data has the same disk footprint as the compressed
.tar.gzarchive)Molecular properties are stored within the molecule objects in the
ensemble.attribattribute of theConformerEnsembleinstance.
[9]:
# if not Path("drugs_crude.clib").is_file():
out_library = ml.ConformerLibrary("drugs_crude.clib", overwrite=False, readonly=False)
with (
open("drugs_crude.msgpack", "rb") as f,
open("names.txt", "wt") as names_out,
tqdm("Recollecting molecules", total=292_000) as pb,
out_library.writing(),
):
ensemble_idx = 0
for mol_1000_dict in msgpack.Unpacker(f):
geom_entry: dict[str, object]
for smi, geom_entry in mol_1000_dict.items():
# So as not to recollect items that we already processed
ensemble_idx += 1
if format(ensemble_idx, "x") in out_library.keys():
# pb.write(f"found {smi}")
pb.update(1)
continue
conformers = geom_entry.pop("conformers")
coords = []
weights = []
atoms = None
for conf in conformers:
axyz = np.array(conf["xyz"])
xyz = np.asarray(axyz[:, 1:])
ats = np.asarray(axyz[:, 0], dtype=int)
if atoms is None:
atoms = ats
else:
assert np.allclose(atoms, ats)
coords.append(xyz)
weights.append(conf["boltzmannweight"])
# Number of unique conformers
n_confs = geom_entry["uniqueconfs"]
charge = geom_entry.pop("charge", 0)
name = format(ensemble_idx, "x")
names_out.write(f"{name:>10} {smi}\n")
ensemble = ml.ConformerEnsemble(
atoms.tolist(),
n_atoms=len(atoms),
n_conformers=len(coords),
coords=coords,
name=name,
)
ensemble.attrib |= geom_entry
ensemble.attrib["smiles"] = smi
out_library[name] = ensemble
pb.update(1)
Step 3. Enjoy the concise syntax for operating with the molecule objects¶
This gives the statistics for the number of conformers in the clib file.
[10]:
!molli stats "m.n_conformers" drugs_crude.clib
count 292035.000000
mean 106.918787
std 166.366268
min 1.000000
25% 17.000000
50% 52.000000
75% 131.000000
max 7461.000000
dtype: float64
[11]:
# This jupyter magic will show a given conformer ensemble
# At this point we may see that the
%clib_view drugs_crude.clib 47393
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol
Step 4. Reimport bonds¶
The previous conformer ensembles feature an almost complete structure. The missing component is the bonding table.
We will be using the simplest way to do so: by using OpenBabel.
[12]:
source = ml.ConformerLibrary("drugs_crude.clib", readonly=True)
destn = ml.ConformerLibrary("drugs_crude_bonded.clib", overwrite=False, readonly=False)
with source.reading(), destn.writing():
for k in (pb := tqdm(source)):
ens = source[k]
# This loads an XYZ files with openbabel parser which does a good job at guessing the bonding.
try:
mol: ml.Molecule = ml.loads(ens[0].dumps_xyz(), "xyz", parser="openbabel")
ens.connect_like(mol)
except:
pb.write(f"Failed: {k}")
else:
destn[k] = ens
Failed: 1fef8
Failed: 47393
Failed: 3e842
Failed: 37c99
Failed: 3b947
Failed: 467b1
Failed: 19581
[13]:
# This jupyter magic will show a given conformer ensemble
# This time the bonding information should be accounted for.
%clib_view drugs_crude_bonded.clib 12af
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol