Profiling Python with cProfile, and a speedup tip

We would love for our Python programs to run as fast as possible, but figuring out how to speed things up requires gathering information about the current state of our code and knowing techniques to speed things up. First and foremost, we need to know where our program is spending its time, and what is relatively slow so we can attack that first. Once we know where the issues are, we can look at options for speeding things up using a technique that solves that problem.

In this post, we will look at a profiling tool and a technique for making it faster. There are a number of profiling tools available for Python, but the standard profiler distributed with Python is a good place to start.

cProfile (and profile, the pure Python profiler) are deterministic profilers. This means that they will tell you exactly how many times a line of code executes, and how much time it relatively takes to execute. There are profilers that take non-deterministic samples of running processes that can also be used to observe running processes and also build a profile by observing many samples over a longer time period. 

Note that deterministic profilers like cProfile can significantly slow down the execution of code, the value they provide is in determing where things are slow. For benchmarking purposes, other tools (like timeit) should be used. When combined with a profiler, it is possible to isolate slow code and verify speed improvements.

An example

To setup the use of cProfile, let’s look at a semi realistic example. Let’s say that we are maintaining a script that has to check installation fees for retail items from a database. For this example, I’ll build the database locally using sqlite, which means it will be pretty fast. However, in real life, this database may be on a remote database server, or might involve a more complex query, or may be a web service. As we move from accessing data in memory, to accessing it off disk, to accessing it over a local network, to accessing a server that is potentially multiple countries away, we can see speed differences of 10x, 100x, 1000x, or more, and it’s helpful to be thinking about these sorts of numbers in this context.

But let’s build a simple example so we can profile it.

Note that the following code was executed using Python 3.8.6.

import cProfile
import sqlite3
conn = sqlite3.connect('test.db')
def initialize_data():
c = conn.cursor()

c.execute('''DROP TABLE IF EXISTS installation_fee''')
c.execute('''CREATE TABLE installation_fee
             (item text, fee float)''')

c.execute("INSERT INTO installation_fee VALUES ('widget-1',100)")
c.execute("INSERT INTO installation_fee VALUES ('widget-2',0.50)")
c.execute("INSERT INTO installation_fee VALUES ('widget-9',0.40)")
c.execute("INSERT INTO installation_fee VALUES ('widget-99',0)")

conn.commit()

initialize_data()

def check_installation_fee(item):
    c = conn.cursor()

    c.execute("SELECT fee FROM installation_fee WHERE item = ?", (item,))

    res = c.fetchone()

    if res:
        return res[0]

Let’s say that we have a script that will have a large list of items and we need to work through the list, checking the fees. The naive implementation will check the database for each item.

import random
def fee_check():
     # sample of random widgets in our catalog
     orders = random.choices([i for i in range(100)], k=1000)
     return dict(zip(orders, [check_installation_fee(f'widget-{w}') for w in orders]))

Basic profiling

The quickest way to profile using cProfile is to use the run method. We’ll use cProfile over profile since it’s much faster. You supply runwith a Python statement to execute. In our case, we can just run our fee_check function. Note that this can be hard to read in a smaller browser, so I’d encourage you to consider also running this code on your own so you can see the formatting in context.

>>> cProfile.run("fee_check()")
         5009 function calls in 0.016 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.001    0.000    0.015    0.000 <ipython-input-2-743cd89b4a06>:1(check_installation_fee)
        1    0.000    0.000    0.016    0.016 <ipython-input-3-62685da51a62>:3(fee_check)
        1    0.000    0.000    0.000    0.000 <ipython-input-3-62685da51a62>:5(<listcomp>)
        1    0.001    0.001    0.016    0.016 <ipython-input-3-62685da51a62>:6(<listcomp>)
        1    0.000    0.000    0.016    0.016 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 random.py:386(choices)
        1    0.000    0.000    0.000    0.000 random.py:399(<listcomp>)
        1    0.000    0.000    0.016    0.016 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
     1000    0.000    0.000    0.000    0.000 {method 'cursor' of 'sqlite3.Connection' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     1000    0.014    0.000    0.014    0.000 {method 'execute' of 'sqlite3.Cursor' objects}
     1000    0.000    0.000    0.000    0.000 {method 'fetchone' of 'sqlite3.Cursor' objects}
     1000    0.000    0.000    0.000    0.000 {method 'random' of '_random.Random' objects}

The run function will print a summary. Beneath the total number of function calls and total run time it shows a row for each line of code executed. For each row, it prints the number of calls (ncalls), the total time spent in the function itself (tottime), the time per function call (percall), the cumulative time in the function and all the functions it calls (cumtime), and finally the cumulative time spent per function call (the second percall). 

From these results, we see that the numbers are very small, but if you look at tottime you see that the execute method of the sqlite3.Cursorobject is where we spend the most of our time. This makes sense, it’s the point where our query has to be parsed, the database checked, and the result returned.

Other options for running cProfile

The run method can take an argument for changing the sorting, and you can also save the results of your profile run to a file that can be further analyzed. So we can sort on tottime using the string tottime or the SortKey.

>>> from pstats import SortKey
>>> cProfile.run("fee_check()", sort=SortKey.TIME)
         5009 function calls in 0.025 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.021    0.000    0.021    0.000 {method 'execute' of 'sqlite3.Cursor' objects}
        1    0.001    0.001    0.024    0.024 <ipython-input-3-62685da51a62>:6(<listcomp>)
     1000    0.001    0.000    0.023    0.000 <ipython-input-2-743cd89b4a06>:1(check_installation_fee)
     1000    0.001    0.000    0.001    0.000 {method 'cursor' of 'sqlite3.Connection' objects}
        1    0.000    0.000    0.000    0.000 random.py:399(<listcomp>)
     1000    0.000    0.000    0.000    0.000 {method 'fetchone' of 'sqlite3.Cursor' objects}
     1000    0.000    0.000    0.000    0.000 {method 'random' of '_random.Random' objects}
        1    0.000    0.000    0.025    0.025 <ipython-input-3-62685da51a62>:3(fee_check)
        1    0.000    0.000    0.025    0.025 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 random.py:386(choices)
        1    0.000    0.000    0.000    0.000 <ipython-input-3-62685da51a62>:5(<listcomp>)
        1    0.000    0.000    0.025    0.025 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}

In cases where the profile is very large, or requires further analysis, it makes sense to persist it to a file for later analysis. The file can be loaded using pstats and sorted, filtered, and analyzed.

>>> import pstats
>>> cProfile.run("fee_check()", filename="fee_check.stats")
>>> stats = pstats.Stats("fee_check.stats")
>>> stats.sort_stats(SortKey.TIME).print_stats();
Thu Feb 11 18:41:51 2021    fee_check.stats

         5009 function calls in 0.021 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.018    0.000    0.018    0.000 {method 'execute' of 'sqlite3.Cursor' objects}
        1    0.001    0.001    0.020    0.020 <ipython-input-3-62685da51a62>:6(<listcomp>)
     1000    0.001    0.000    0.019    0.000 <ipython-input-2-743cd89b4a06>:1(check_installation_fee)
     1000    0.001    0.000    0.001    0.000 {method 'cursor' of 'sqlite3.Connection' objects}
        1    0.000    0.000    0.000    0.000 /Users/mcw/.pyenv/versions/3.8.6/lib/python3.8/random.py:399(<listcomp>)
     1000    0.000    0.000    0.000    0.000 {method 'fetchone' of 'sqlite3.Cursor' objects}
     1000    0.000    0.000    0.000    0.000 {method 'random' of '_random.Random' objects}
        1    0.000    0.000    0.021    0.021 <ipython-input-3-62685da51a62>:3(fee_check)
        1    0.000    0.000    0.021    0.021 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 /Users/mcw/.pyenv/versions/3.8.6/lib/python3.8/random.py:386(choices)
        1    0.000    0.000    0.000    0.000 <ipython-input-3-62685da51a62>:5(<listcomp>)
        1    0.000    0.000    0.021    0.021 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}

Easy profiling with context manager support

One more thing that is new for cProfile since Python 3.8 is context manager support. What this means is that you can make a local profiling context and quickly profile some local code. With an import, 2 lines of code (and an indent of course) you can quickly get an idea of what your code is up to. No more excuses to not profile!

>>> with cProfile.Profile() as pf:
        for i in range(10):
            check_installation_fee(f'widget-{i}')
>>> pf.print_stats()
         42 function calls in 0.001 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    0.000    0.000    0.001    0.000 <ipython-input-2-743cd89b4a06>:1(check_installation_fee)
        1    0.000    0.000    0.000    0.000 cProfile.py:133(__exit__)
       10    0.000    0.000    0.000    0.000 {method 'cursor' of 'sqlite3.Connection' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       10    0.001    0.000    0.001    0.000 {method 'execute' of 'sqlite3.Cursor' objects}
       10    0.000    0.000    0.000    0.000 {method 'fetchone' of 'sqlite3.Cursor' objects}

The speedup

Now that we know where most of our code is spending its time, if we want to make this code faster, what should we do? I’m sure there’s a good chance you would say we should just cache the values from the installation fee check in our code. If we are hitting the database more than once for the same value (and it doesn’t change often or won’t change during the processing of our code), caching makes a lot of sense. You could do this any number of ways, including adding a cache into the check_installation_fee method with a simple dictionary that stores the results when found and checks for them before hitting the database in future invocations. You could also add this code where you call this function.

But to keep things as simple as possible, let’s just use the caching functionality available in the functools module. If you’re using Python 3.2 or above, the lru_cache decorator is available, and you can add caching with one line of code. This is a simple “Least Recently Used” cache, that will store the results of the function call locally in the function (keyed by the arguments to the function) and return those values instead of invoking the function in subsequent calls if they are in the cache. You have the option to set the cache size and tune this a little bit, check the docs for details.

>>> import functools
>>> @functools.lru_cache
    def check_installation_fee_cached(item):
        return check_installation_fee(item)
>>> def fee_check_cached():
        orders = random.choices([i for i in range(100)], k=1000)
        return dict(zip(orders, [check_installation_fee_cached(f'widget-{w}') for w in orders]))
>>> cProfile.run("fee_check_cached()")
         1509 function calls in 0.004 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100    0.000    0.000    0.003    0.000 <ipython-input-2-743cd89b4a06>:1(check_installation_fee)
      100    0.000    0.000    0.003    0.000 <ipython-input-8-a0ae210d96b8>:3(check_installation_fee_cached)
        1    0.000    0.000    0.004    0.004 <ipython-input-8-a0ae210d96b8>:7(fee_check_cached)
        1    0.000    0.000    0.000    0.000 <ipython-input-8-a0ae210d96b8>:8(<listcomp>)
        1    0.001    0.001    0.004    0.004 <ipython-input-8-a0ae210d96b8>:9(<listcomp>)
        1    0.000    0.000    0.004    0.004 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 random.py:386(choices)
        1    0.000    0.000    0.000    0.000 random.py:399(<listcomp>)
        1    0.000    0.000    0.004    0.004 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
      100    0.000    0.000    0.000    0.000 {method 'cursor' of 'sqlite3.Connection' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
      100    0.003    0.000    0.003    0.000 {method 'execute' of 'sqlite3.Cursor' objects}
      100    0.000    0.000    0.000    0.000 {method 'fetchone' of 'sqlite3.Cursor' objects}
     1000    0.000    0.000    0.000    0.000 {method 'random' of '_random.Random' objects}

Wow, it looks like it did a great job of running much faster! In this limited example its 6x faster, with only 100 cursor executions (as we’d expect). And we only had to add a simple decorator.

One cool thing about this implementation is that it adds methods to our function for checking how the cache is doing, as well as clearing it. If you have long running code and need to ensure you hit the database for fresh values, you can do that by clearing the cache before using the function.

>>> check_installation_fee_cached.cache_info()
CacheInfo(hits=900, misses=100, maxsize=128, currsize=100)
>>> check_installation_fee_cached.cache_clear()

There’s more caching options in the functools module, including a simpler interface for an unbound cache if you’re using Python 3.9 or above. You can also easily add a cached property to your class if you’re using Python 3.8 or above. Don’t go and write this from scratch unless you are sure these won’t work for you.

In summary, in this post we learned a few simple ways to profile our code using cProfile. Once we saw where our code was spending time, we sped it up using a simple cache provided by the functools module.

There are many other ways to profile Python code, I’ll dig into more of those in future posts, along with ways to speed up the code that the profile reveals is slowing us down.

Don't miss any articles!

If you like this article, give me your email and I'll send you my latest articles along with other helpful links and tips with a focus on Python, pandas, and related tools.

Invalid email address
I promise not to spam you, and you can unsubscribe at any time.

Have anything to say about this topic?