Example 2: Recompile Crude GeomSet¶

Objectives¶

In this notebook we show the workflow that compiles the data from one published dataset Key references

Axelrod, S., Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data 9, 185 (2022).
Axelrod, Simon; Gomez-Bombarelli, Rafael, 2021, “GEOM”, , Harvard Dataverse, V4; molecule_net.tar.gz [fileName]

Prerequisites¶

pandas
py3Dmol

No additional files, besides this notebook, will be required. However, if you would like to manually download the molecule_net.tar.gz file from the server, therefore bypassing one of the steps here, you are welcome to do so.

Hardware Specification for Rerun¶

Desktop workstation with 2x (AMD EPYC 7702 64-Core) with total of 128 physical and 256 logical cores, 1024 GB DDR4 with Ubuntu 22.04 LTS operating system.

[5]:

# Imports required to execute this notebook
import molli as ml
try:
    import ujson as json
except:
    import json
import msgpack
import numpy as np
from pathlib import Path
from tqdm.notebook import tqdm
from pathlib import Path
import tarfile
ml.visual.configure(bgcolor="white")

# This is to suppress warnings
from openbabel import pybel
pybel.ob.obErrorLog.SetOutputLevel(0)

Step 1. Download the data archive¶

drugs_crude.msgpack.tar.gz

[6]:

# Definitions of key paths
drugs_crude_targz = Path("drugs_crude.msgpack.tar.gz")
drugs_crude_mpack = Path("drugs_crude.msgpack")

Download the required molecule_net dataset. This is done manually in this notebook to make sure the workflow would be reproducible on both Windows ans Linux

[7]:

if not drugs_crude_targz.is_file():
    import requests
    with requests.get("https://dataverse.harvard.edu/api/access/datafile/4360331", stream=True) as rq:
        rq.raise_for_status()
        with open(drugs_crude_targz, "wb") as f:
            for chunk in rq.iter_content(128*1024*1024): # iterate over data in 128 MiB chunks
                f.write(chunk)

[8]:

if not drugs_crude_mpack.is_file():
    with tarfile.open(drugs_crude_targz, "r:gz") as tf:
        tf.extractall()

Step 2. Convert the data to molli `.clib` format¶

Now that we have the raw data, we will reimport it in molli format. The advantages of such storage technique are:

Lightweight file format (the reinterpreted data has the same disk footprint as the compressed .tar.gz archive)
Molecular properties are stored within the molecule objects in the ensemble.attrib attribute of the ConformerEnsemble instance.

[9]:

# if not Path("drugs_crude.clib").is_file():
out_library = ml.ConformerLibrary("drugs_crude.clib", overwrite=False, readonly=False)
with (
    open("drugs_crude.msgpack", "rb") as f,
    open("names.txt", "wt") as names_out,
    tqdm("Recollecting molecules", total=292_000) as pb,
    out_library.writing(),
):
    ensemble_idx = 0
    for mol_1000_dict in msgpack.Unpacker(f):
        geom_entry: dict[str, object]
        for smi, geom_entry in mol_1000_dict.items():
            # So as not to recollect items that we already processed
            ensemble_idx += 1
            if format(ensemble_idx, "x") in out_library.keys():
                # pb.write(f"found {smi}")
                pb.update(1)
                continue

            conformers = geom_entry.pop("conformers")

            coords = []
            weights = []
            atoms = None
            for conf in conformers:
                axyz = np.array(conf["xyz"])
                xyz = np.asarray(axyz[:, 1:])
                ats = np.asarray(axyz[:, 0], dtype=int)
                if atoms is None:
                    atoms = ats
                else:
                    assert np.allclose(atoms, ats)
                coords.append(xyz)
                weights.append(conf["boltzmannweight"])

            # Number of unique conformers
            n_confs = geom_entry["uniqueconfs"]
            charge = geom_entry.pop("charge", 0)

            name = format(ensemble_idx, "x")

            names_out.write(f"{name:>10}  {smi}\n")

            ensemble = ml.ConformerEnsemble(
                atoms.tolist(),
                n_atoms=len(atoms),
                n_conformers=len(coords),
                coords=coords,
                name=name,
            )

            ensemble.attrib |= geom_entry
            ensemble.attrib["smiles"] = smi

            out_library[name] = ensemble

            pb.update(1)

Step 3. Enjoy the concise syntax for operating with the molecule objects¶

This gives the statistics for the number of conformers in the clib file.

[10]:

!molli stats "m.n_conformers" drugs_crude.clib

count    292035.000000
mean        106.918787
std         166.366268
min           1.000000
25%          17.000000
50%          52.000000
75%         131.000000
max        7461.000000
dtype: float64

[11]:

# This jupyter magic will show a given conformer ensemble
# At this point we may see that the
%clib_view drugs_crude.clib 47393

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

Step 4. Reimport bonds¶

The previous conformer ensembles feature an almost complete structure. The missing component is the bonding table.

We will be using the simplest way to do so: by using OpenBabel.

[12]:

source = ml.ConformerLibrary("drugs_crude.clib", readonly=True)
destn = ml.ConformerLibrary("drugs_crude_bonded.clib", overwrite=False, readonly=False)

with source.reading(), destn.writing():
    for k in (pb := tqdm(source)):
        ens = source[k]
        # This loads an XYZ files with openbabel parser which does a good job at guessing the bonding.
        try:
            mol: ml.Molecule = ml.loads(ens[0].dumps_xyz(), "xyz", parser="openbabel")
            ens.connect_like(mol)
        except:
            pb.write(f"Failed: {k}")
        else:
            destn[k] = ens

Failed: 1fef8
Failed: 47393
Failed: 3e842
Failed: 37c99
Failed: 3b947
Failed: 467b1
Failed: 19581

[13]:

# This jupyter magic will show a given conformer ensemble
# This time the bonding information should be accounted for.
%clib_view drugs_crude_bonded.clib 12af