A pandas.DataFrame.apply example

I recently saw a question about pandas.DataFrame.apply and realized that when I first started using Pandas I would often attempt to solve problems with apply when a vectorized solution was what I should have been using instead. Let’s say that you have an existing function to calculate the present value of an investment that takes scalar arguments and you also have a DataFrame of investments, perhaps loaded from a csv file or database.

PV = FV / (1 + i) ** n
def present_value(fv, i_rate, n_periods):
    return fv / (1 + i_rate) ** n_periods

If someone has given us this function, we might be tempted to just use it on our data. So here’s what a DataFrame might look like with some values.

df = pd.DataFrame([(1000, 0.05, 12), (1000, 0.07, 12), (1000, 0.09, 12), (500, 0.02, 24)],
               columns=['fv', 'i_rate', 'n_periods'])

One way to apply a function to a DataFrame is to manually iterate over the items in the frame and apply the function.

for (index, row) in df.iterrows():
    df.loc[index, 'pv'] = present_value(row.fv, row.i_rate, row.n_periods)

Another way to reuse that existing function is to use apply on the DataFrame, using axis=1 to apply it to each row (instead of each column).

df['pv'] = df.apply(lambda r: present_value(r['fv'], r['i_rate'], r['n_periods']), axis=1)

The problem with this technique is it isn’t vectorized. We are going to force the present_value function to be evaluated once for each row in the DataFrame, and this will be much more expensive than a similar vectorized solution. In fact, apply is even evaluated twice on the first row (for the current implementation) since it can choose an optimized path based on the result, so the function being applied should not have side effects.

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address

So in this case, we should consider a vectorized solution.

df['pv2'] = df['fv']/(1 + df['i_rate']) ** df['n_periods']

If we time these two versions, we can see the vectorized version is more than twice as fast. Here’s the full result.

Have anything to say about this topic?