Parameterizing and automating Jupyter notebooks with papermill

Have you ever created a Jupyter notebook and wished you could generate the notebook with a different set of parameters? If so, you’ve probably done at least one of the following:

  1. Edited the variables in a cell and reran the notebook, saving off a copy as needed
  2. Saved a copy of the notebook and maybe hacked up code to edit the values directly in the .ipynb files and reran notebooks
  3. Built some custom code to set the variables with data loaded from a database or configuration file, then reran the notebook

It turns out that there is a good solution for this problem that parameterizes interactive notebooks and coexists well with automated jobs, it’s called papermill.

Motivation

Many notebook authors use the standard practice of designating a cell near the top of their notebooks for global variables. The author or other users of the notebook then modifies the values in the cell and runs the entire notebook to obtain different results. To persist the output, the author will manually download the notebook in another format or save it as a different notebook file. But using only a notebook server and these manual methods can quickly become messy and difficult to track, not to mention error prone. Which notebook is the one you edit? Papermill helps solve this problem. In this article, I’ll introduce papermill and basic usage, walk through an example of parameterization, and finally talk about ways to fully schedule and automate notebook execution using cron.

With papermill, a special cell in the notebook is designated for parameters. When papermill executes a parameterized notebook, either via the command line interface (CLI) or using the Python API, parameters are passed in and executed in a subsequent cell. This allows the notebook to be run multiple times with different parameters quickly. The resulting executed notebook can then be saved in a variety of places, including local or cloud storage.

Installation

To install papermill, use pip. I’d recommend using a virtual environment using virtualenv or conda. I often recommend using pyenv to install a recent Python version and for creating a virtualenv. But use whatever you are most comfortable with.

pip install papermill

If you would like to use the various input and output options (like Amazon’s s3 or Microsoft’s azure, you can install all the dependencies. I won’t get into the detail here, but the documentation covers those options, and you can even extend papermill to add other handlers for input/output (I/O) of notebooks.

pip install papermill[all]

Basic use

The first thing most users will want to do with papermill is parameterize a notebook. I made a simple example notebook that you can download and follow along. Once you have Jupyter running and have opened a notebook, all you need to do is add a parameters tag to the cell with parameters in it.

How you add a tag in Jupyter notebook.

Save the notebook, and now you are ready to execute it using papermill. For the example notebook, use the CLI to run the notebook, supplying your own name.

papermill -p name Matt papermill_example1.ipynb papermill_matt.ipynb

This command is telling papermill to execute the input notebook papermill_example1.ipynb and write the output to papermill_matt.ipynb, while setting the parameter name to the value Matt. If you open the resulting notebook, the contents will now include a new cell after the parameters-tagged one with an injected-parameters tag like this.

The notebook after parameters are injected (with the new cell)

You should now see how you can add as many parameters as you need to make new notebooks from an existing notebook. Think of the main notebook (in our case, papermill_example1.ipynb) as a template that you can use to make as many copies as you want by quickly injecting parameters.

Basic API use

You may want to fetch or build your injected parameters using Python code, and so a Python API is also available to execute papermill. We can achieve the exact same result as above, in a Python script (or in a notebook, it works great there as well – and will show you the progress dynamically).

import papermill as pm

name = "Matt"
res = pm.execute_notebook(
    'papermill_example1.ipynb',
    'papermill_{name}.ipynb',
    parameters = dict(name=name)
)
{"version_major":2,"version_minor":0,"model_id":"cf8280b216094bf6a75a9536b6505051"}

More parameter passing

So far we’ve passed only one parameter, and have used the -p option to do this. You can pass parameters a couple of ways.

Command Line

You can run these all using the example notebook, then view the results yourself. First, you can specify multiple parameters from the CLI. Even if a parameters doesn’t exist in the notebook yet, parameters can be passed in and created. In that case, papermill will create an injected-parameters cell and execute it at the top of the notebook.

Here’s an example.

papermill -p name Matt -p level 5 -p factor 0.33 -p alive True papermill_example1.ipynb papermill_matt.ipynb

or with long options instead…

papermill --parameters name Matt --parameters level 5 --parameters factor 0.33 --parameters alive True papermill_example1.ipynb papermill_matt.ipynb

Note that the -p or --parameters option will try to parse integers and floats, so if you want them to be interpreted as strings, you use the -r or --raw option to get all values in as strings.

papermill -r name Matt -r level 5 -r factor 0.33 -r alive True papermill_example1.ipynb papermill_matt.ipynb

You can also use yaml for specifying parameters. This can be passed in via a file (-f or --parameters_file), a string (-y or --parameters_yaml) or a base64 encoded string (-b or --parameters_base64). This allows you to pass in more complex data, including lists and dictionaries.

papermill papermill_example1.ipynb papermill_matt.ipynb -y "
name: Matt
level: 5
factor: 0.33
alive: True
sizes:
    - 1.0
    - 2.5
    - 3.7
params:
    x: 3
    y: 4"

You can base64 encode the string pretty easily. (Run this in your shell on Mac or Linux or Windows WSL in the directory where the notebook file is).

echo  "
name: Matt
level: 5
factor: 0.33
alive: True
sizes:
    - 1.0
    - 2.5
    - 3.7
params:
    x: 3
    y: 4" > params.yaml

Now you can run the file version.

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address

Don't miss any articles!

If you like this article, give me your email and I'll send you my latest articles along with other helpful links and tips with a focus on Python, pandas, and related tools.

Invalid email address
I promise not to spam you, and you can unsubscribe at any time.
papermill papermill_example1.ipynb papermill_matt.ipynb -f params.yaml

Or the base64 version

PARAMS=$(cat params.yaml| base64) # makes the base64 version of the yaml file
papermill papermill_example1.ipynb papermill_matt.ipynb -b $PARAMS

Either way, you should get the idea that you can pass complex data into your notebook from the command line, and also via the API. These examples all use the local filesystem for input and output of notebooks, but note that you can read and write notebooks from Amazon s3, Azure, Google Cloud Storage, or web servers.

Inspecting notebooks

You can also inspect the available parameters of a notebook, from the CLI.

$ papermill --help-notebook papermill_example1.ipynb
Usage: papermill [OPTIONS] NOTEBOOK_PATH [OUTPUT_PATH]

Parameters inferred for notebook 'papermill_example1.ipynb':
  name: Unknown type (default "Joe")

Or using the Python API.

pm.inspect_notebook('papermill_example1.ipynb')
{'name': {'name': 'name',
  'inferred_type_name': 'None',
  'default': '"Joe"',
  'help': ''}}

Executing a full workflow

A typical workflow for papermill is to have a parameterized notebook, run it with multiple values, then convert the resulting notebooks into another format for review or reporting. Let’s walk through an example of how this might be setup.

First, we have a parameterized notebook that uses the Yahoo! finance API to fetch stock prices and plot data with the all time high price of the stock (or at least it’s the high for the last two years since I’m only fetching that much data at this point).

If you want to run this example, you will need to ensure you have the yfinance API installed as well as matplotlib. You can install both with pip if needed.

We can use the papermill CLI to inspect the parameters.

$ papermill --help-notebook papermill_example2.ipynb
Usage: papermill [OPTIONS] NOTEBOOK_PATH [OUTPUT_PATH]

Parameters inferred for notebook 'papermill_example2.ipynb':
  symbol: Unknown type (default 'AAPL')

We’ll run this notebook with several symbols. I’ve chosen to use a shell script for this so that I can run it through a scheduled cron job. If desired, this could just as easily be done using a simple Python script. However, if you are using a virtual enviroment you may end up needing a script anyway for ensuring the virtualenv is loaded properly. In that case, it might just be easier to use shell script for the entire process.

I’m also going to use the jupyter nbconvert (or you can run it as jupyter-nbconvert) command to convert the notebook into an html file for viewing via a web browser. Just like papermill, nbconvert is available via the command line or using the Python API.

The automation script

#!/bin/bash

set -eux

# activate our virtualenv (this was created using pyenv-virtualenv, yours will be elsewhere)
source /Users/mcw/.pyenv/versions/3.8.6/envs/pandas/bin/activate

# get to the script directory if running via cron
cd $(dirname "${BASH_SOURCE[0]}")

for S in AAPL MSFT GOOG FB
do
        papermill -p symbol $S papermill_example2.ipynb papermill_${S}.ipynb
        jupyter-nbconvert --no-input --to html papermill_${S}.ipynb
done

You can run this command from your shell (after adjusting the line that activates the virtual environment to reflect your own setup). You can also schedule it to run regularly in cron pretty easily. For example, you can run this report every weekday at 4 PM like this (with your own path).

00 16 * * mon-fri /Users/mcw/projects/python_blogposts/tools/run_papermill.sh

Extending the example

With just a little more creativity (and software configuration on nbconvert), you can output the notebooks to PDF or other formats, send them via email, or upload them to a server to have nice looking reports updated on a daily basis.

Note that the per-symbol notebooks are saved to the local disk. They can be opened in Jupyter server and re-executed easily if debugging or further work is required. Just know that if you have an automated job running, the notebooks will be replaced each time it runs. Ideally, you want to work on your main template notebook, then generate new versions for each symbol with automation.

One other tip is that papermill can read and write to standard input and output. This means that if you have other tools that take notebook files as input, you don’t have to write the files out to disk. For example, in our shell script above, we could prevent writing out each individual notebook file per symbol and do the following inside our loop instead.

papermill -p symbol $S papermill_example2.ipynb | jupyter-nbconvert --stdin --no-input --to html --output report_${S}.html

Note that if you do this, you’ll need to open the main notebook (papermill_example2.ipynb) and edit your parameters to debug issues. But maybe that’s prefereable if you need to save disk space and don’t need the ability to debug each notebook separately.

Summary

Papermill is a useful library to parameterize and execute Jupyter notebooks. You can use it to automate execution of your notebooks with any sets of parameters you can dream up. Follow this up with a conversion of the notebook using nbconvert to provide readable and useful versions of your notebooks.

There is much more that can be done with notebook automation, but starting with papermill as a tool to execute and parameterize notebooks is a good platform to build on.

Have anything to say about this topic?