{ "cells": [ { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I recently wrote an ETL job to be run in a data processing pipeline. The job fetches data from four database tables in our data lake, stores them in pandas DataFrames and outputs a single DataFrame. \n", "So in pseudocode:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "def fetch_data_x(t_start: pd.Timestamp, t_end: pd.Timestamp):\n", " print(f\"Fetching data X between {t_start} and {t_end}\")\n", " return t_start, t_end\n", " \n", "def fetch_data_y(t_start: pd.Timestamp, t_end: pd.Timestamp):\n", " print(f\"Fetching data Y between {t_start} and {t_end}\")\n", " return t_start, t_end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the fetching operations are the most time consuming part of the pipeline, I \n", "wanted to cache the fetched results. I wrote a decorator:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What's wrong with this piece of code?" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import pickle\n", "def cache_result(cache_dir='./tmp'):\n", " def decorator(func):\n", " def wrapped(t_start, t_end):\n", " cache_dir = Path(cache_dir)\n", " function_details = f\"{func.__code__.co_name},{t_start.isoformat()},{t_end.isoformat}\"\n", " cache_filepath = cache_dir.joinpath(function_details)\n", " \n", " try:\n", " print(f\"Reading from cache: {cache_filepath}\")\n", " with open(cache_filepath, 'rb') as f:\n", " res = pickle.load(f)\n", " except:\n", " res = func(t_start, t_end)\n", " print(f\"Writing to cache: {cache_filepath}\")\n", " with open(cache_filepath, 'wb') as f:\n", " pickle.dump(res, f)\n", " return res\n", " return wrapped\n", " return decorator" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "@cache_result()\n", "def fetch_data_x(t_start: pd.Timestamp, t_end: pd.Timestamp):\n", " print(f\"Fetching data X between {t_start} and {t_end}\")\n", " return t_start, t_end\n", " \n", "@cache_result()\n", "def fetch_data_y(t_start: pd.Timestamp, t_end: pd.Timestamp):\n", " print(f\"Fetching data Y between {t_start} and {t_end}\")\n", " return t_start, t_end\n", " " ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "ename": "UnboundLocalError", "evalue": "local variable 'cache_dir' referenced before assignment", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mUnboundLocalError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mfetch_data_x\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTimestamp\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'2020-01-01'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTimestamp\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'2020-02-02'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m\u001b[0m in \u001b[0;36mwrapped\u001b[0;34m(t_start, t_end)\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mdecorator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mwrapped\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mt_start\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mt_end\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mcache_dir\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mPath\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcache_dir\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 7\u001b[0m \u001b[0mfunction_details\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34mf\"{func.__code__.co_name},{t_start.isoformat()},{t_end.isoformat}\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0mcache_filepath\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcache_dir\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoinpath\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunction_details\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mUnboundLocalError\u001b[0m: local variable 'cache_dir' referenced before assignment" ] } ], "source": [ "fetch_data_x(pd.Timestamp('2020-01-01'), pd.Timestamp('2020-02-02'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What is going on?\n", "This took me quite a while to figure out. I read through [a couple](https://wiki.python.org/moin/PythonDecorators) of [in-depth](https://www.python.org/dev/peps/pep-0318/) description of how to write decorators, but was none the wiser. \n", "\n", "In the end, I figured out this `UnboundLocalError` has less to do with decorators than namespaces. In particular, we are allowed to *reference* a variable defined in an outer scope from an inner scope, but not to reassign it. More details follow:\n", "\n", "## Referencing variable defind in outer scope: OK" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Hello world'" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foo = \"Hello \"\n", "\n", "def hello(x):\n", " return foo+x\n", "hello(\"world\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assigning to variable already defind in outer scope: also fine. A new variable is created, variable in outer scope not modified." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Hi! world'" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "foo = \"Hello \"\n", "\n", "def hello(x):\n", " foo = \"Hi! \"\n", " return foo+x\n", "hello(\"world\")\n", "\n", "hello(\"world\")" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "foo is: Hello \n" ] } ], "source": [ "print(f\"foo is: {foo}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## All hell breaks loose if you do *both*: reference `foo` and at the same time try to assign to it:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "ename": "UnboundLocalError", "evalue": "local variable 'foo' referenced before assignment", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mUnboundLocalError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mfoo\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 8\u001b[0;31m \u001b[0mhello\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"world\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m\u001b[0m in \u001b[0;36mhello\u001b[0;34m(x)\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mhello\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0;32mif\u001b[0m \u001b[0mfoo\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m\"Hello\"\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5\u001b[0m \u001b[0mfoo\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"Hi\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mfoo\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mUnboundLocalError\u001b[0m: local variable 'foo' referenced before assignment" ] } ], "source": [ "foo = \"Hello \"\n", "\n", "def hello(x):\n", " if foo == \"Hello\":\n", " foo = \"Hi\"\n", " return foo+x\n", "\n", "hello(\"world\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Now the fixed decorator" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "def cache_result(cache_dir='./tmp'):\n", " def decorator(func):\n", " def wrapped(t_start, t_end):\n", " cache_dir_path = Path(cache_dir)\n", " function_details = f\"{func.__code__.co_name}_{t_start.isoformat()}_{t_end.isoformat()}\"\n", " cache_filepath = cache_dir_path.joinpath(function_details)\n", " cache_dir_path.mkdir(parents=True, exist_ok=True)\n", " try:\n", " print(f\"Reading from cache: {cache_filepath}\")\n", " with open(cache_filepath, 'rb') as f:\n", " res = pickle.load(f)\n", " except:\n", " print(f\"Failed to read from cache: {cache_filepath}\")\n", " res = func(t_start, t_end)\n", " print(f\"Writing to cache: {cache_filepath}\")\n", " with open(cache_filepath, 'wb') as f:\n", " pickle.dump(res, f)\n", " return res\n", " return wrapped\n", " return decorator\n", "\n", "@cache_result()\n", "def fetch_data_x(t_start: pd.Timestamp, t_end: pd.Timestamp):\n", " print(f\"Fetching data X between {t_start} and {t_end}\")\n", " return t_start, t_end\n", " \n", "@cache_result()\n", "def fetch_data_y(t_start: pd.Timestamp, t_end: pd.Timestamp):\n", " print(f\"Fetching data Y between {t_start} and {t_end}\")\n", " return t_start, t_end\n", " " ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading from cache: tmp/fetch_data_x_2020-01-02T00:00:00_2020-02-02T00:00:00\n" ] }, { "data": { "text/plain": [ "(Timestamp('2020-01-02 00:00:00'), Timestamp('2020-02-02 00:00:00'))" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fetch_data_x(pd.Timestamp('2020-01-02'), pd.Timestamp('2020-02-02'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some useful articles on the topic:\n", "* https://stackoverflow.com/a/23558809\n", "* https://medium.com/@dannymcwaves/a-python-tutorial-to-understanding-scopes-and-closures-c6a3d3ba0937" ] } ], "metadata": { "kernelspec": { "display_name": "moia", "language": "python", "name": "python3" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "nav_menu": {}, "nikola": { "category": "", "date": "2020-03-30 23:13:50 UTC+02:00", "description": "", "link": "", "slug": "a-python-decorator-puzzle", "tags": "", "title": "A python decorator puzzle", "type": "text" }, "toc": { "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 6, "toc_cell": true, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }