{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example 2: Recompile Crude GeomSet\n", "\n", "## Objectives\n", "\n", "In this notebook we show the workflow that compiles the data from one published dataset\n", "Key references\n", "1. [Axelrod, S., Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data 9, 185 (2022). ](https://doi.org/10.1038/s41597-022-01288-4)\n", "2. [ Axelrod, Simon; Gomez-Bombarelli, Rafael, 2021, \"GEOM\", , Harvard Dataverse, V4; molecule_net.tar.gz [fileName] ](https://doi.org/10.7910/DVN/JNGTDF)\n", "\n", "## Prerequisites\n", "\n", "- `pandas`\n", "- `py3Dmol`\n", "\n", "No additional files, besides this notebook, will be required.\n", "However, if you would like to manually download the molecule_net.tar.gz file from the server, therefore bypassing one of the steps here, you are welcome to do so.\n", "\n", "## Hardware Specification for Rerun\n", "\n", "Desktop workstation with 2x (AMD EPYC 7702 64-Core) with total of 128 physical and 256 logical cores, 1024 GB DDR4 with Ubuntu 22.04 LTS operating system." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Imports required to execute this notebook\n", "import molli as ml\n", "try:\n", " import ujson as json\n", "except:\n", " import json\n", "import msgpack\n", "import numpy as np\n", "from pathlib import Path\n", "from tqdm.notebook import tqdm\n", "from pathlib import Path\n", "import tarfile\n", "ml.visual.configure(bgcolor=\"white\")\n", "\n", "# This is to suppress warnings\n", "from openbabel import pybel\n", "pybel.ob.obErrorLog.SetOutputLevel(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Step 1. Download the data archive\n", "`drugs_crude.msgpack.tar.gz`" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Definitions of key paths\n", "drugs_crude_targz = Path(\"drugs_crude.msgpack.tar.gz\")\n", "drugs_crude_mpack = Path(\"drugs_crude.msgpack\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download the required molecule_net dataset. This is done *manually* in this notebook to make sure the workflow would be reproducible on both Windows ans Linux" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "if not drugs_crude_targz.is_file():\n", " import requests\n", " with requests.get(\"https://dataverse.harvard.edu/api/access/datafile/4360331\", stream=True) as rq:\n", " rq.raise_for_status()\n", " with open(drugs_crude_targz, \"wb\") as f: \n", " for chunk in rq.iter_content(128*1024*1024): # iterate over data in 128 MiB chunks\n", " f.write(chunk)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "if not drugs_crude_mpack.is_file():\n", " with tarfile.open(drugs_crude_targz, \"r:gz\") as tf:\n", " tf.extractall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2. Convert the data to molli `.clib` format\n", "\n", "Now that we have the raw data, we will reimport it in molli format. The advantages of such storage technique are:\n", "1. Lightweight file format (the reinterpreted data has the same disk footprint as the compressed `.tar.gz` archive) \n", "2. Molecular properties are stored *within* the molecule objects in the `ensemble.attrib` attribute of the `ConformerEnsemble` instance." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e710ffa8066942a28d8672e8fdc92c1f", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/292000 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# if not Path(\"drugs_crude.clib\").is_file():\n", "out_library = ml.ConformerLibrary(\"drugs_crude.clib\", overwrite=False, readonly=False)\n", "with (\n", " open(\"drugs_crude.msgpack\", \"rb\") as f,\n", " open(\"names.txt\", \"wt\") as names_out,\n", " tqdm(\"Recollecting molecules\", total=292_000) as pb,\n", " out_library.writing(),\n", "):\n", " ensemble_idx = 0\n", " for mol_1000_dict in msgpack.Unpacker(f):\n", " geom_entry: dict[str, object]\n", " for smi, geom_entry in mol_1000_dict.items():\n", " # So as not to recollect items that we already processed\n", " ensemble_idx += 1\n", " if format(ensemble_idx, \"x\") in out_library.keys():\n", " # pb.write(f\"found {smi}\") \n", " pb.update(1)\n", " continue\n", "\n", " conformers = geom_entry.pop(\"conformers\")\n", "\n", " coords = []\n", " weights = []\n", " atoms = None\n", " for conf in conformers:\n", " axyz = np.array(conf[\"xyz\"])\n", " xyz = np.asarray(axyz[:, 1:])\n", " ats = np.asarray(axyz[:, 0], dtype=int)\n", " if atoms is None:\n", " atoms = ats\n", " else:\n", " assert np.allclose(atoms, ats)\n", " coords.append(xyz)\n", " weights.append(conf[\"boltzmannweight\"])\n", "\n", " # Number of unique conformers\n", " n_confs = geom_entry[\"uniqueconfs\"]\n", " charge = geom_entry.pop(\"charge\", 0)\n", "\n", " name = format(ensemble_idx, \"x\")\n", "\n", " names_out.write(f\"{name:>10} {smi}\\n\")\n", "\n", " ensemble = ml.ConformerEnsemble(\n", " atoms.tolist(),\n", " n_atoms=len(atoms),\n", " n_conformers=len(coords),\n", " coords=coords,\n", " name=name,\n", " )\n", "\n", " ensemble.attrib |= geom_entry\n", " ensemble.attrib[\"smiles\"] = smi\n", "\n", " out_library[name] = ensemble\n", "\n", " pb.update(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3. Enjoy the concise syntax for operating with the molecule objects\n", "\n", "This gives the statistics for the number of conformers in the clib file." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count 292035.000000\n", "mean 106.918787\n", "std 166.366268\n", "min 1.000000\n", "25% 17.000000\n", "50% 52.000000\n", "75% 131.000000\n", "max 7461.000000\n", "dtype: float64\n" ] } ], "source": [ "!molli stats \"m.n_conformers\" drugs_crude.clib" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "application/3dmoljs_load.v0": "
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n jupyter labextension install jupyterlab_3dmol
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n",
" jupyter labextension install jupyterlab_3dmol
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n jupyter labextension install jupyterlab_3dmol
You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n",
" jupyter labextension install jupyterlab_3dmol