A python decorator puzzle

In [14]:
import pandas as pd

I recently wrote an ETL job to be run in a data processing pipeline. The job fetches data from four database tables in our data lake, stores them in pandas DataFrames and outputs a single DataFrame. So in pseudocode:

In [44]:
def fetch_data_x(t_start: pd.Timestamp, t_end: pd.Timestamp):
    print(f"Fetching data X between {t_start} and {t_end}")
    return t_start, t_end
    
def fetch_data_y(t_start: pd.Timestamp, t_end: pd.Timestamp):
    print(f"Fetching data Y between {t_start} and {t_end}")
    return t_start, t_end

Since the fetching operations are the most time consuming part of the pipeline, I wanted to cache the fetched results. I wrote a decorator:

What's wrong with this piece of code?

In [45]:
from pathlib import Path
import pickle
def cache_result(cache_dir='./tmp'):
    def decorator(func):
        def wrapped(t_start, t_end):
            cache_dir = Path(cache_dir)
            function_details = f"{func.__code__.co_name},{t_start.isoformat()},{t_end.isoformat}"
            cache_filepath = cache_dir.joinpath(function_details)
            
            try:
                print(f"Reading from cache: {cache_filepath}")
                with open(cache_filepath, 'rb') as f:
                    res = pickle.load(f)
            except:
                res = func(t_start, t_end)
                print(f"Writing to cache: {cache_filepath}")
                with open(cache_filepath, 'wb') as f:
                    pickle.dump(res, f)
            return res
        return wrapped
    return decorator
In [46]:
@cache_result()
def fetch_data_x(t_start: pd.Timestamp, t_end: pd.Timestamp):
    print(f"Fetching data X between {t_start} and {t_end}")
    return t_start, t_end
    
@cache_result()
def fetch_data_y(t_start: pd.Timestamp, t_end: pd.Timestamp):
    print(f"Fetching data Y between {t_start} and {t_end}")
    return t_start, t_end
    
In [47]:
fetch_data_x(pd.Timestamp('2020-01-01'), pd.Timestamp('2020-02-02'))
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-47-9ccb94c9170a> in <module>
----> 1 fetch_data_x(pd.Timestamp('2020-01-01'), pd.Timestamp('2020-02-02'))

<ipython-input-45-d206a3d6af1a> in wrapped(t_start, t_end)
      4     def decorator(func):
      5         def wrapped(t_start, t_end):
----> 6             cache_dir = Path(cache_dir)
      7             function_details = f"{func.__code__.co_name},{t_start.isoformat()},{t_end.isoformat}"
      8             cache_filepath = cache_dir.joinpath(function_details)

UnboundLocalError: local variable 'cache_dir' referenced before assignment

What is going on?

This took me quite a while to figure out. I read through a couple of in-depth description of how to write decorators, but was none the wiser.

In the end, I figured out this UnboundLocalError has less to do with decorators than namespaces. In particular, we are allowed to reference a variable defined in an outer scope from an inner scope, but not to reassign it. More details follow:

Referencing variable defind in outer scope: OK

In [51]:
foo = "Hello "

def hello(x):
    return foo+x
hello("world")
Out[51]:
'Hello world'

Assigning to variable already defind in outer scope: also fine. A new variable is created, variable in outer scope not modified.

In [56]:
foo = "Hello "

def hello(x):
    foo = "Hi! "
    return foo+x
hello("world")

hello("world")
Out[56]:
'Hi! world'
In [55]:
print(f"foo is: {foo}")
foo is: Hello 

All hell breaks loose if you do both: reference foo and at the same time try to assign to it:

In [57]:
foo = "Hello "

def hello(x):
    if foo == "Hello":
        foo = "Hi"
    return foo+x

hello("world")
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-57-5b276360a2fe> in <module>
      6     return foo+x
      7 
----> 8 hello("world")

<ipython-input-57-5b276360a2fe> in hello(x)
      2 
      3 def hello(x):
----> 4     if foo == "Hello":
      5         foo = "Hi"
      6     return foo+x

UnboundLocalError: local variable 'foo' referenced before assignment

Now the fixed decorator

In [58]:
def cache_result(cache_dir='./tmp'):
    def decorator(func):
        def wrapped(t_start, t_end):
            cache_dir_path = Path(cache_dir)
            function_details = f"{func.__code__.co_name}_{t_start.isoformat()}_{t_end.isoformat()}"
            cache_filepath = cache_dir_path.joinpath(function_details)
            cache_dir_path.mkdir(parents=True, exist_ok=True)
            try:
                print(f"Reading from cache: {cache_filepath}")
                with open(cache_filepath, 'rb') as f:
                    res = pickle.load(f)
            except:
                print(f"Failed to read from cache: {cache_filepath}")
                res = func(t_start, t_end)
                print(f"Writing to cache: {cache_filepath}")
                with open(cache_filepath, 'wb') as f:
                    pickle.dump(res, f)
            return res
        return wrapped
    return decorator

@cache_result()
def fetch_data_x(t_start: pd.Timestamp, t_end: pd.Timestamp):
    print(f"Fetching data X between {t_start} and {t_end}")
    return t_start, t_end
    
@cache_result()
def fetch_data_y(t_start: pd.Timestamp, t_end: pd.Timestamp):
    print(f"Fetching data Y between {t_start} and {t_end}")
    return t_start, t_end
    
In [49]:
fetch_data_x(pd.Timestamp('2020-01-02'), pd.Timestamp('2020-02-02'))
Reading from cache: tmp/fetch_data_x_2020-01-02T00:00:00_2020-02-02T00:00:00
Out[49]:
(Timestamp('2020-01-02 00:00:00'), Timestamp('2020-02-02 00:00:00'))