{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example 1: Recompile MolNet\n", "\n", "## Objectives\n", "\n", "In this notebook we show the workflow that compiles the data from one published dataset\n", "Key references\n", "1. [Axelrod, S., Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data 9, 185 (2022). ](https://doi.org/10.1038/s41597-022-01288-4)\n", "2. [ Axelrod, Simon; Gomez-Bombarelli, Rafael, 2021, \"GEOM\", , Harvard Dataverse, V4; molecule_net.tar.gz [fileName] ](https://doi.org/10.7910/DVN/JNGTDF)\n", "\n", "## Prerequisites\n", "\n", "- `pandas`\n", "- `py3Dmol`\n", "\n", "No additional files, besides this notebook, will be required.\n", "However, if you would like to manually download the molecule_net.tar.gz file from the server, therefore bypassing one of the steps here, you are welcome to do so.\n", "\n", "## Hardware Specification for Rerun\n", "\n", "Desktop workstation with 2x (AMD EPYC 7702 64-Core) with total of 128 physical and 256 logical cores, 1024 GB DDR4 with Ubuntu 22.04 LTS operating system." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Imports required to execute this notebook\n", "import molli as ml\n", "try:\n", " import ujson as json\n", "except:\n", " import json\n", "import pickle\n", "from pathlib import Path\n", "from tqdm.notebook import tqdm\n", "from pathlib import Path\n", "import tarfile\n", "ml.visual.configure(bgcolor=\"white\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Step 1. Download the `molecule_net.tar.gz` archive" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Definitions of key paths\n", "molnet_targz = Path(\"molecule_net.tar.gz\")\n", "molnet_root = Path(\"molecule_net\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download the required molecule_net dataset. This is done *manually* in this notebook to make sure the workflow would be reproducible on both Windows ans Linux" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "if not molnet_targz.is_file():\n", " import requests\n", " with requests.get(\"https://dataverse.harvard.edu/api/access/datafile/5858506\", stream=True) as rq:\n", " rq.raise_for_status()\n", " with open(molnet_targz, \"wb\") as f:\n", " for chunk in rq.iter_content(128*1024*1024): # iterate over data in 128 MiB chunks\n", " f.write(chunk)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "if not molnet_root.is_dir():\n", " with tarfile.open(molnet_targz, \"r:gz\") as tf:\n", " tf.extractall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2. Convert the data to molli `.clib` format\n", "\n", "Now that we have the raw data, we will reimport it in molli format. The advantages of such storage technique are:\n", "1. Lightweight file format (the reinterpreted data has the same disk footprint as the compressed `.tar.gz` archive) \n", "2. Molecular properties are stored *within* the molecule objects in the `ensemble.attrib` attribute of the `ConformerEnsemble` instance." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "cdbaf72a6255419cba1ad1dfa3e5235b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Importing molecule_net rdkit molecular data: 0%| | 0/16865 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from molli.external.rdkit import from_rdmol\n", "library = ml.ConformerLibrary(\"molnet.clib\", overwrite=False, readonly=False)\n", "\n", "# Contains SMILES and Serialized Information\n", "with open(molnet_root / \"summary.json\", \"rt\") as f:\n", " summary = json.load(f)\n", "\n", "with library.writing():\n", " for i, (smi, entry) in tqdm(\n", " enumerate(summary.items()),\n", " total=len(summary),\n", " desc=\"Importing molecule_net rdkit molecular data\",\n", " ):\n", " pkl_path = Path(entry[\"pickle_path\"])\n", "\n", " # In lieu of better naming for the files, we opted to use the\n", " # pickle file names. This is totally not necessary, and the user may choose\n", " # their own optimal naming scheme.\n", " name = pkl_path.stem\n", " if not name:\n", " continue\n", " \n", " # This step is a guard in case we are trying to import a file that already exists in the destination.\n", " if name in library.keys():\n", " continue\n", "\n", " with open(molnet_root / pkl_path, \"rb\") as f:\n", " pkl = pickle.load(f)\n", "\n", " charge = pkl[\"charge\"]\n", " \n", " # Each rdkit molecule conformer is now converted into molli.chem.Molecule instance\n", " conformers = [from_rdmol(c[\"rd_mol\"]) for c in pkl[\"conformers\"]]\n", "\n", " weights = [c[\"boltzmannweight\"] for c in pkl[\"conformers\"]]\n", "\n", " pkl_attrib = {\n", " k: v for k, v in pkl.items() if k not in {\"charge\", \"conformers\"}\n", " } | entry\n", "\n", " ensemble = ml.ConformerEnsemble(\n", " conformers, name=name, charge=charge, weights=weights, attrib=pkl_attrib\n", " )\n", "\n", " # This step writes the ensemble into the library file.\n", " library[name] = ensemble" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3. Enjoy the concise syntax for operating with the molecule objects\n", "\n", "This gives the statistics for the number of conformers in the clib file." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count 16865.000000\n", "mean 179.863979\n", "std 442.747245\n", "min 1.000000\n", "25% 8.000000\n", "50% 47.000000\n", "75% 173.000000\n", "max 7461.000000\n", "dtype: float64\n" ] } ], "source": [ "!molli stats \"m.n_conformers\" molnet.clib" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "application/3dmoljs_load.v0": "
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n jupyter labextension install jupyterlab_3dmol
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n",
" jupyter labextension install jupyterlab_3dmol