Example 1: Recompile MolNet

Objectives

In this notebook we show the workflow that compiles the data from one published dataset Key references

  1. Axelrod, S., Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data 9, 185 (2022).

  2. Axelrod, Simon; Gomez-Bombarelli, Rafael, 2021, “GEOM”, , Harvard Dataverse, V4; molecule_net.tar.gz [fileName]

Prerequisites

  • pandas

  • py3Dmol

No additional files, besides this notebook, will be required. However, if you would like to manually download the molecule_net.tar.gz file from the server, therefore bypassing one of the steps here, you are welcome to do so.

Hardware Specification for Rerun

Desktop workstation with 2x (AMD EPYC 7702 64-Core) with total of 128 physical and 256 logical cores, 1024 GB DDR4 with Ubuntu 22.04 LTS operating system.

[12]:
# Imports required to execute this notebook
import molli as ml
try:
    import ujson as json
except:
    import json
import pickle
from pathlib import Path
from tqdm.notebook import tqdm
from pathlib import Path
import tarfile
ml.visual.configure(bgcolor="white")

Step 1. Download the molecule_net.tar.gz archive

[13]:
# Definitions of key paths
molnet_targz = Path("molecule_net.tar.gz")
molnet_root = Path("molecule_net")

Download the required molecule_net dataset. This is done manually in this notebook to make sure the workflow would be reproducible on both Windows ans Linux

[14]:
if not molnet_targz.is_file():
    import requests
    with requests.get("https://dataverse.harvard.edu/api/access/datafile/5858506", stream=True) as rq:
        rq.raise_for_status()
        with open(molnet_targz, "wb") as f:
            for chunk in rq.iter_content(128*1024*1024): # iterate over data in 128 MiB chunks
                f.write(chunk)
[15]:
if not molnet_root.is_dir():
    with tarfile.open(molnet_targz, "r:gz") as tf:
        tf.extractall()

Step 2. Convert the data to molli .clib format

Now that we have the raw data, we will reimport it in molli format. The advantages of such storage technique are:

  1. Lightweight file format (the reinterpreted data has the same disk footprint as the compressed .tar.gz archive)

  2. Molecular properties are stored within the molecule objects in the ensemble.attrib attribute of the ConformerEnsemble instance.

[16]:
from molli.external.rdkit import from_rdmol
library = ml.ConformerLibrary("molnet.clib", overwrite=False, readonly=False)

# Contains SMILES and Serialized Information
with open(molnet_root / "summary.json", "rt") as f:
    summary = json.load(f)

with library.writing():
    for i, (smi, entry) in tqdm(
        enumerate(summary.items()),
        total=len(summary),
        desc="Importing molecule_net rdkit molecular data",
    ):
        pkl_path = Path(entry["pickle_path"])

        # In lieu of better naming for the files, we opted to use the
        # pickle file names. This is totally not necessary, and the user may choose
        # their own optimal naming scheme.
        name = pkl_path.stem
        if not name:
            continue

        # This step is a guard in case we are trying to import a file that already exists in the destination.
        if name in library.keys():
            continue

        with open(molnet_root / pkl_path, "rb") as f:
            pkl = pickle.load(f)

        charge = pkl["charge"]

        # Each rdkit molecule conformer is now converted into molli.chem.Molecule instance
        conformers = [from_rdmol(c["rd_mol"]) for c in pkl["conformers"]]

        weights = [c["boltzmannweight"] for c in pkl["conformers"]]

        pkl_attrib = {
            k: v for k, v in pkl.items() if k not in {"charge", "conformers"}
        } | entry

        ensemble = ml.ConformerEnsemble(
            conformers, name=name, charge=charge, weights=weights, attrib=pkl_attrib
        )

        # This step writes the ensemble into the library file.
        library[name] = ensemble

Step 3. Enjoy the concise syntax for operating with the molecule objects

This gives the statistics for the number of conformers in the clib file.

[17]:
!molli stats "m.n_conformers" molnet.clib
count    16865.000000
mean       179.863979
std        442.747245
min          1.000000
25%          8.000000
50%         47.000000
75%        173.000000
max       7461.000000
dtype: float64
[18]:
# This jupyter magic will show a given conformer ensemble
%clib_view molnet.clib AAAQFGUYHFJNHI-VGUBEVBKNA-N

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol