{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example 1: Recompile MolNet\n", "\n", "## Objectives\n", "\n", "In this notebook we show the workflow that compiles the data from one published dataset\n", "Key references\n", "1. [Axelrod, S., Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Sci Data 9, 185 (2022). ](https://doi.org/10.1038/s41597-022-01288-4)\n", "2. [ Axelrod, Simon; Gomez-Bombarelli, Rafael, 2021, \"GEOM\", , Harvard Dataverse, V4; molecule_net.tar.gz [fileName] ](https://doi.org/10.7910/DVN/JNGTDF)\n", "\n", "## Prerequisites\n", "\n", "- `pandas`\n", "- `py3Dmol`\n", "\n", "No additional files, besides this notebook, will be required.\n", "However, if you would like to manually download the molecule_net.tar.gz file from the server, therefore bypassing one of the steps here, you are welcome to do so.\n", "\n", "## Hardware Specification for Rerun\n", "\n", "Desktop workstation with 2x (AMD EPYC 7702 64-Core) with total of 128 physical and 256 logical cores, 1024 GB DDR4 with Ubuntu 22.04 LTS operating system." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Imports required to execute this notebook\n", "import molli as ml\n", "try:\n", " import ujson as json\n", "except:\n", " import json\n", "import pickle\n", "from pathlib import Path\n", "from tqdm.notebook import tqdm\n", "from pathlib import Path\n", "import tarfile\n", "ml.visual.configure(bgcolor=\"white\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Step 1. Download the `molecule_net.tar.gz` archive" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Definitions of key paths\n", "molnet_targz = Path(\"molecule_net.tar.gz\")\n", "molnet_root = Path(\"molecule_net\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download the required molecule_net dataset. This is done *manually* in this notebook to make sure the workflow would be reproducible on both Windows ans Linux" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "if not molnet_targz.is_file():\n", " import requests\n", " with requests.get(\"https://dataverse.harvard.edu/api/access/datafile/5858506\", stream=True) as rq:\n", " rq.raise_for_status()\n", " with open(molnet_targz, \"wb\") as f:\n", " for chunk in rq.iter_content(128*1024*1024): # iterate over data in 128 MiB chunks\n", " f.write(chunk)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "if not molnet_root.is_dir():\n", " with tarfile.open(molnet_targz, \"r:gz\") as tf:\n", " tf.extractall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2. Convert the data to molli `.clib` format\n", "\n", "Now that we have the raw data, we will reimport it in molli format. The advantages of such storage technique are:\n", "1. Lightweight file format (the reinterpreted data has the same disk footprint as the compressed `.tar.gz` archive) \n", "2. Molecular properties are stored *within* the molecule objects in the `ensemble.attrib` attribute of the `ConformerEnsemble` instance." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "cdbaf72a6255419cba1ad1dfa3e5235b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Importing molecule_net rdkit molecular data: 0%| | 0/16865 [00:00\n

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n jupyter labextension install jupyterlab_3dmol

\n \n", "text/html": [ "
\n", "

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
\n", " jupyter labextension install jupyterlab_3dmol

\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/3dmoljs_load.v0": "", "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/3dmoljs_load.v0": "", "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This jupyter magic will show a given conformer ensemble\n", "%clib_view molnet.clib AAAQFGUYHFJNHI-VGUBEVBKNA-N" ] } ], "metadata": { "kernelspec": { "display_name": "dev-blake", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 2 }