Indexing and Selecting in Pandas by Callable

This is the fourth entry in a series on indexing and selecting in pandas. In summary, this is what we’ve covered:

In all of the discussion so far, we’ve focused on the three main methods of selecting data in the two main pandas data structures, Series and DataFrame

  • The array indexing operator, or []
  • The .loc selector, for selection using the label on the index
  • The .iloc selector, for selection using the location

We noted in the last entry in the series that all three can take a boolean vector as indexer to select data from the object. It turns out that you can also pass in a callable. If you’re not familiar with a callable, it can be a function, or object with a ___call___ method. When used for pandas selection, the callable needs to take one argument, which will be the pandas object, and return a result that will select data from the dataset. Why would you want to do this? We’ll look at how this can be useful.

In this series I’ve been grabbing data from the Chicago Data Portal. For this post, I’ve grabbed the list of current employees for the city. This includes full time and part time, salaried and hourly data.

>>> import pandas as pd
>>>
>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited
>>> # it also only returns 1000 rows (or at least it did for me without an API key)
>>> df = pd.read_json("https://data.cityofchicago.org/resource/xzkq-xp2w.json")
>>> df.dtypes
name                  object
job_titles            object
department            object
full_or_part_time     object
salary_or_hourly      object
annual_salary        float64
typical_hours        float64
hourly_rate          float64
dtype: object
>>> df.describe()
       annual_salary  typical_hours  hourly_rate
count     785.000000     215.000000   215.000000
mean    87307.076637      35.558140    34.706000
std     20342.094746       8.183932    13.027963
min     20568.000000      20.000000     3.000000
25%     76164.000000      40.000000    22.350000
50%     87006.000000      40.000000    38.350000
75%     97386.000000      40.000000    44.400000
max    180000.000000      40.000000    57.040000
>>> df.shape
(1000, 8)
>>> df = df.drop('name', axis=1)   # no need to include personal info in this post

Simple callables

So we have some data, which is a subset of the total list of employees for the city of Chicago. The full dataset should be about 32,000 rows.

Before we give a few examples, let’s clarify what this callable should do. First, the callable will take one argument, which will be the DataFrame or Series being indexed. What you need to returns a valid value for indexing. This could be any value that we’ve already discussed in earlier posts.

So, if we are using the array indexing operator, on a DataFrame you’ll remember that you can pass in a single column, or a list of columns to select.

>>> def select_job_titles(df):
...     return "job_titles"
...
>>> df[select_job_titles]
0                                    SERGEANT
1      POLICE OFFICER (ASSIGNED AS DETECTIVE)
2                    CHIEF CONTRACT EXPEDITER
3                           CIVIL ENGINEER IV
4                            CONCRETE LABORER
                        ...
995                 AVIATION SECURITY OFFICER
996                           FIREFIGHTER-EMT
997                              LIBRARIAN IV
998               HUMAN SERVICE SPECIALIST II
999                            POLICE OFFICER
Name: job_titles, Length: 1000, dtype: object
>>> def select_job_titles_typical_hours(df):
...     return ["job_titles", "typical_hours"]
...
>>> df[select_job_titles_typical_hours].dropna()
                           job_titles  typical_hours
4                    CONCRETE LABORER           40.0
6         TRAFFIC CONTROL AIDE-HOURLY           20.0
7                 ELECTRICAL MECHANIC           40.0
10                 FOSTER GRANDPARENT           20.0
21   ELECTRICAL MECHANIC (AUTOMOTIVE)           40.0
..                                ...            ...
971              CONSTRUCTION LABORER           40.0
974                 HOISTING ENGINEER           40.0
977              CONSTRUCTION LABORER           40.0
988              CONSTRUCTION LABORER           40.0
991                SANITATION LABORER           40.0

[215 rows x 2 columns]

We can also return a boolean indexer, since that’s a valid argument.

>>> def select_20_hours_or_less(df):
...     return df['typical_hours'] <= 20
...
>>> df[select_20_hours_or_less].head(1)
                    job_titles department full_or_part_time salary_or_hourly  annual_salary  typical_hours  hourly_rate
6  TRAFFIC CONTROL AIDE-HOURLY       OEMC                 P           Hourly            NaN           20.0        19.86

You can also use callables for both the first (row indexer) and second (column indexer) arguments in a DataFrame.

>>> df.loc[lambda df: df['typical_hours'] <= 20, lambda df: ['job_titles', 'typical_hours']].head()
                      job_titles  typical_hours
6    TRAFFIC CONTROL AIDE-HOURLY           20.0
10            FOSTER GRANDPARENT           20.0
91                CROSSING GUARD           20.0
113             SENIOR COMPANION           20.0
125    TITLE V PROGRAM TRAINEE I           20.0

But why?

OK, so this all seems kind of unnecessary because you could do this much more directly. Why write a separate function to provide another level of redirection?

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address

I have to admit that before writing this post, I don’t think that I’ve used callable indexing much, if at all. But one use case where it’s helpful is something that I do all the time. Maybe you do as well.

Let’s say we want to find departments with an average hourly pay rate below some threshold. Usually you’ll do a group by followed by a selector on the resulting groupby DataFrame.

>>> temp = df.groupby('job_titles').mean()
>>> temp[temp['hourly_rate'] < 20].head()
                          annual_salary  typical_hours  hourly_rate
job_titles
ALDERMANIC AIDE                 41760.0           25.0       14.000
CROSSING GUARD - PER CBA            NaN           20.0       15.195
CUSTODIAL WORKER                    NaN           40.0       19.200
FOSTER GRANDPARENT                  NaN           20.0        3.000
HOSPITALITY WORKER                  NaN           20.0       14.110

But with a callable, you can do this without the temporary DataFrame variable.

>>> df.groupby('job_titles').mean().loc[lambda df: df['hourly_rate'] < 20].head()
                          annual_salary  typical_hours  hourly_rate
job_titles
ALDERMANIC AIDE                 41760.0           25.0       14.000
CROSSING GUARD - PER CBA            NaN           20.0       15.195
CUSTODIAL WORKER                    NaN           40.0       19.200
FOSTER GRANDPARENT                  NaN           20.0        3.000
HOSPITALITY WORKER                  NaN           20.0       14.110

One thing to note is that there’s nothing special about these callables. They still have to return the correct values for the selector you are choosing to use. So for example, you can do this with loc:

>>> df.loc[lambda df: df['department'] == 'CITY COUNCIL'].head(1)
                      job_titles    department full_or_part_time salary_or_hourly  annual_salary  typical_hours  hourly_rate
124  STUDENT INTERN - ALDERMANIC  CITY COUNCIL                 F           Hourly            NaN           35.0         14.0

But you can’t do this, because .iloc requires a boolean vector without an index (as I talked about in the post on boolean indexing.

>>> try:
...     df.iloc[lambda df: df['department'] == 'CITY COUNCIL']
... except NotImplementedError as nie:
...     print(nie)
...
iLocation based boolean indexing on an integer type is not available
>>> df.iloc[lambda df: (df['department'] == 'CITY COUNCIL').values].head(1)
                      job_titles    department full_or_part_time salary_or_hourly  annual_salary  typical_hours  hourly_rate
124  STUDENT INTERN - ALDERMANIC  CITY COUNCIL                 F           Hourly            NaN           35.0         14.0
>>> # or
>>> df.iloc[lambda df: (df['department'] == 'CITY COUNCIL').to_numpy()].head(1)
                      job_titles    department full_or_part_time salary_or_hourly  annual_salary  typical_hours  hourly_rate
124  STUDENT INTERN - ALDERMANIC  CITY COUNCIL                 F           Hourly            NaN           35.0         14.0

Also, while I’ve used the DataFrame for all these examples, this works in Series as well.

>>> s[lambda s: s < 30000]
175    20568.0
Name: annual_salary, dtype: float64

In summary, indexing with a callable allows some flexibity for condensing some code that would otherwise require temporary variables. The thing to remember about the callable technique is that it will need to return a result that is an acceptable argument in the same place as the callable. 

The next article in this series covers the.where method.

2 thoughts on “Indexing and Selecting in Pandas by Callable

  1. Great article. I spend my entire workday in data frames and I’m always looking for better ways to slice, filter, etc.

    Cheers,
    G

Have anything to say about this topic?