{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example 2: Recompile Crude GeomSet\n", "\n", "## Objectives\n", "\n", "In this notebook we show the workflow that compiles the data from one published dataset\n", "Key references\n", "1. [Axelrod, S., Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data 9, 185 (2022). ](https://doi.org/10.1038/s41597-022-01288-4)\n", "2. [ Axelrod, Simon; Gomez-Bombarelli, Rafael, 2021, \"GEOM\", , Harvard Dataverse, V4; molecule_net.tar.gz [fileName] ](https://doi.org/10.7910/DVN/JNGTDF)\n", "\n", "## Prerequisites\n", "\n", "- `pandas`\n", "- `py3Dmol`\n", "\n", "No additional files, besides this notebook, will be required.\n", "However, if you would like to manually download the molecule_net.tar.gz file from the server, therefore bypassing one of the steps here, you are welcome to do so.\n", "\n", "## Hardware Specification for Rerun\n", "\n", "Desktop workstation with 2x (AMD EPYC 7702 64-Core) with total of 128 physical and 256 logical cores, 1024 GB DDR4 with Ubuntu 22.04 LTS operating system." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Imports required to execute this notebook\n", "import molli as ml\n", "try:\n", " import ujson as json\n", "except:\n", " import json\n", "import msgpack\n", "import numpy as np\n", "from pathlib import Path\n", "from tqdm.notebook import tqdm\n", "from pathlib import Path\n", "import tarfile\n", "ml.visual.configure(bgcolor=\"white\")\n", "\n", "# This is to suppress warnings\n", "from openbabel import pybel\n", "pybel.ob.obErrorLog.SetOutputLevel(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Step 1. Download the data archive\n", "`drugs_crude.msgpack.tar.gz`" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Definitions of key paths\n", "drugs_crude_targz = Path(\"drugs_crude.msgpack.tar.gz\")\n", "drugs_crude_mpack = Path(\"drugs_crude.msgpack\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download the required molecule_net dataset. This is done *manually* in this notebook to make sure the workflow would be reproducible on both Windows ans Linux" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "if not drugs_crude_targz.is_file():\n", " import requests\n", " with requests.get(\"https://dataverse.harvard.edu/api/access/datafile/4360331\", stream=True) as rq:\n", " rq.raise_for_status()\n", " with open(drugs_crude_targz, \"wb\") as f: \n", " for chunk in rq.iter_content(128*1024*1024): # iterate over data in 128 MiB chunks\n", " f.write(chunk)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "if not drugs_crude_mpack.is_file():\n", " with tarfile.open(drugs_crude_targz, \"r:gz\") as tf:\n", " tf.extractall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2. Convert the data to molli `.clib` format\n", "\n", "Now that we have the raw data, we will reimport it in molli format. The advantages of such storage technique are:\n", "1. Lightweight file format (the reinterpreted data has the same disk footprint as the compressed `.tar.gz` archive) \n", "2. Molecular properties are stored *within* the molecule objects in the `ensemble.attrib` attribute of the `ConformerEnsemble` instance." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e710ffa8066942a28d8672e8fdc92c1f", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/292000 [00:0010} {smi}\\n\")\n", "\n", " ensemble = ml.ConformerEnsemble(\n", " atoms.tolist(),\n", " n_atoms=len(atoms),\n", " n_conformers=len(coords),\n", " coords=coords,\n", " name=name,\n", " )\n", "\n", " ensemble.attrib |= geom_entry\n", " ensemble.attrib[\"smiles\"] = smi\n", "\n", " out_library[name] = ensemble\n", "\n", " pb.update(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3. Enjoy the concise syntax for operating with the molecule objects\n", "\n", "This gives the statistics for the number of conformers in the clib file." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count 292035.000000\n", "mean 106.918787\n", "std 166.366268\n", "min 1.000000\n", "25% 17.000000\n", "50% 52.000000\n", "75% 131.000000\n", "max 7461.000000\n", "dtype: float64\n" ] } ], "source": [ "!molli stats \"m.n_conformers\" drugs_crude.clib" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "application/3dmoljs_load.v0": "
\n

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n jupyter labextension install jupyterlab_3dmol

\n
\n", "text/html": [ "
\n", "

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n", " jupyter labextension install jupyterlab_3dmol

\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/3dmoljs_load.v0": "", "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/3dmoljs_load.v0": "", "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This jupyter magic will show a given conformer ensemble\n", "# At this point we may see that the \n", "%clib_view drugs_crude.clib 47393" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4. Reimport bonds\n", "The previous conformer ensembles feature an *almost* complete structure. The missing component is the bonding table.\n", "\n", "We will be using the simplest way to do so: by using OpenBabel." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "470af79ef62e4b73bb2f92ced906f528", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/292035 [00:00\n

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n jupyter labextension install jupyterlab_3dmol

\n \n", "text/html": [ "
\n", "

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n", " jupyter labextension install jupyterlab_3dmol

\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/3dmoljs_load.v0": "", "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/3dmoljs_load.v0": "", "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This jupyter magic will show a given conformer ensemble\n", "# This time the bonding information should be accounted for.\n", "%clib_view drugs_crude_bonded.clib 12af " ] } ], "metadata": { "kernelspec": { "display_name": "dev-blake", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 2 }