Indexing and Selecting in Pandas by Callable

This is the fourth entry in a series on indexing and selecting in pandas. In summary, this is what we’ve covered:

In all of the discussion so far, we’ve focused on the three main methods of selecting data in the two main pandas data structures, Series and DataFrame

  • The array indexing operator, or []
  • The .loc selector, for selection using the label on the index
  • The .iloc selector, for selection using the location

We noted in the last entry in the series that all three can take a boolean vector as indexer to select data from the object. It turns out that you can also pass in a callable. If you’re not familiar with a callable, it can be a function, or object with a ___call___ method. When used for pandas selection, the callable needs to take one argument, which will be the pandas object, and return a result that will select data from the dataset. Why would you want to do this? We’ll look at how this can be useful.

In this series I’ve been grabbing data from the Chicago Data Portal. For this post, I’ve grabbed the list of current employees for the city. This includes full time and part time, salaried and hourly data.

>>> import pandas as pd
>>>
>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited
>>> # it also only returns 1000 rows (or at least it did for me without an API key)
>>> df = pd.read_json("https://data.cityofchicago.org/resource/xzkq-xp2w.json")
>>> df.dtypes
name                  object
job_titles            object
department            object
full_or_part_time     object
salary_or_hourly      object
annual_salary        float64
typical_hours        float64
hourly_rate          float64
dtype: object
>>> df.describe()
       annual_salary  typical_hours  hourly_rate
count     785.000000     215.000000   215.000000
mean    87307.076637      35.558140    34.706000
std     20342.094746       8.183932    13.027963
min     20568.000000      20.000000     3.000000
25%     76164.000000      40.000000    22.350000
50%     87006.000000      40.000000    38.350000
75%     97386.000000      40.000000    44.400000
max    180000.000000      40.000000    57.040000
>>> df.shape
(1000, 8)
>>> df = df.drop('name', axis=1)   # no need to include personal info in this post

Simple callables

So we have some data, which is a subset of the total list of employees for the city of Chicago. The full dataset should be about 32,000 rows.

Before we give a few examples, let’s clarify what this callable should do. First, the callable will take one argument, which will be the DataFrame or Series being indexed. What you need to returns a valid value for indexing. This could be any value that we’ve already discussed in earlier posts.

So, if we are using the array indexing operator, on a DataFrame you’ll remember that you can pass in a single column, or a list of columns to select.

>>> def select_job_titles(df):
...     return "job_titles"
...
>>> df[select_job_titles]
0                                    SERGEANT
1      POLICE OFFICER (ASSIGNED AS DETECTIVE)
2                    CHIEF CONTRACT EXPEDITER
3                           CIVIL ENGINEER IV
4                            CONCRETE LABORER
                        ...
995                 AVIATION SECURITY OFFICER
996                           FIREFIGHTER-EMT
997                              LIBRARIAN IV
998               HUMAN SERVICE SPECIALIST II
999                            POLICE OFFICER
Name: job_titles, Length: 1000, dtype: object
>>> def select_job_titles_typical_hours(df):
...     return ["job_titles", "typical_hours"]
...
>>> df[select_job_titles_typical_hours].dropna()
                           job_titles  typical_hours
4                    CONCRETE LABORER           40.0
6         TRAFFIC CONTROL AIDE-HOURLY           20.0
7                 ELECTRICAL MECHANIC           40.0
10                 FOSTER GRANDPARENT           20.0
21   ELECTRICAL MECHANIC (AUTOMOTIVE)           40.0
..                                ...            ...
971              CONSTRUCTION LABORER           40.0
974                 HOISTING ENGINEER           40.0
977              CONSTRUCTION LABORER           40.0
988              CONSTRUCTION LABORER           40.0
991                SANITATION LABORER           40.0

[215 rows x 2 columns]

We can also return a boolean indexer, since that’s a valid argument.

>>> def select_20_hours_or_less(df):
...     return df['typical_hours'] <= 20
...
>>> df[select_20_hours_or_less].head(1)
                    job_titles department full_or_part_time salary_or_hourly  annual_salary  typical_hours  hourly_rate
6  TRAFFIC CONTROL AIDE-HOURLY       OEMC                 P           Hourly            NaN           20.0        19.86

You can also use callables for both the first (row indexer) and second (column indexer) arguments in a DataFrame.

>>> df.loc[lambda df: df['typical_hours'] <= 20, lambda df: ['job_titles', 'typical_hours']].head()
                      job_titles  typical_hours
6    TRAFFIC CONTROL AIDE-HOURLY           20.0
10            FOSTER GRANDPARENT           20.0
91                CROSSING GUARD           20.0
113             SENIOR COMPANION           20.0
125    TITLE V PROGRAM TRAINEE I           20.0

But why?

OK, so this all seems kind of unnecessary because you could do this much more directly. Why write a separate function to provide another level of redirection?

Don't miss any articles!

If you like this article, give me your email and I'll send you my latest articles along with other helpful links and tips with a focus on Python, pandas, and related tools.

Invalid email address
I promise not to spam you, and you can unsubscribe at any time.

I have to admit that before writing this post, I don’t think that I’ve used callable indexing much, if at all. But one use case where it’s helpful is something that I do all the time. Maybe you do as well.

Let’s say we want to find departments with an average hourly pay rate below some threshold. Usually you’ll do a group by followed by a selector on the resulting groupby DataFrame.

>>> temp = df.groupby('job_titles').mean()
>>> temp[temp['hourly_rate'] < 20].head()
                          annual_salary  typical_hours  hourly_rate
job_titles
ALDERMANIC AIDE                 41760.0           25.0       14.000
CROSSING GUARD - PER CBA            NaN           20.0       15.195
CUSTODIAL WORKER                    NaN           40.0       19.200
FOSTER GRANDPARENT                  NaN           20.0        3.000
HOSPITALITY WORKER                  NaN           20.0       14.110

But with a callable, you can do this without the temporary DataFrame variable.

>>> df.groupby('job_titles').mean().loc[lambda df: df['hourly_rate'] < 20].head()
                          annual_salary  typical_hours  hourly_rate
job_titles
ALDERMANIC AIDE                 41760.0           25.0       14.000
CROSSING GUARD - PER CBA            NaN           20.0       15.195
CUSTODIAL WORKER                    NaN           40.0       19.200
FOSTER GRANDPARENT                  NaN           20.0        3.000
HOSPITALITY WORKER                  NaN           20.0       14.110

One thing to note is that there’s nothing special about these callables. They still have to return the correct values for the selector you are choosing to use. So for example, you can do this with loc:

>>> df.loc[lambda df: df['department'] == 'CITY COUNCIL'].head(1)
                      job_titles    department full_or_part_time salary_or_hourly  annual_salary  typical_hours  hourly_rate
124  STUDENT INTERN - ALDERMANIC  CITY COUNCIL                 F           Hourly            NaN           35.0         14.0

But you can’t do this, because .iloc requires a boolean vector without an index (as I talked about in the post on boolean indexing.

>>> try:
...     df.iloc[lambda df: df['department'] == 'CITY COUNCIL']
... except NotImplementedError as nie:
...     print(nie)
...
iLocation based boolean indexing on an integer type is not available
>>> df.iloc[lambda df: (df['department'] == 'CITY COUNCIL').values].head(1)
                      job_titles    department full_or_part_time salary_or_hourly  annual_salary  typical_hours  hourly_rate
124  STUDENT INTERN - ALDERMANIC  CITY COUNCIL                 F           Hourly            NaN           35.0         14.0
>>> # or
>>> df.iloc[lambda df: (df['department'] == 'CITY COUNCIL').to_numpy()].head(1)
                      job_titles    department full_or_part_time salary_or_hourly  annual_salary  typical_hours  hourly_rate
124  STUDENT INTERN - ALDERMANIC  CITY COUNCIL                 F           Hourly            NaN           35.0         14.0

Also, while I’ve used the DataFrame for all these examples, this works in Series as well.

>>> s[lambda s: s < 30000]
175    20568.0
Name: annual_salary, dtype: float64

In summary, indexing with a callable allows some flexibity for condensing some code that would otherwise require temporary variables. The thing to remember about the callable technique is that it will need to return a result that is an acceptable argument in the same place as the callable. 

The next article in this series covers the.where method.

2 thoughts on “Indexing and Selecting in Pandas by Callable

  1. Great article. I spend my entire workday in data frames and I’m always looking for better ways to slice, filter, etc.

    Cheers,
    G

Have anything to say about this topic?