This is the fourth entry in a series on indexing and selecting in pandas. In summary, this is what we’ve covered:
In all of the discussion so far, we’ve focused on the three main methods of selecting data in the two main pandas data structures,
- The array indexing operator, or
.locselector, for selection using the label on the index
.ilocselector, for selection using the location
We noted in the last entry in the series that all three can take a boolean vector as indexer to select data from the object. It turns out that you can also pass in a callable. If you’re not familiar with a callable, it can be a function, or object with a
___call___ method. When used for pandas selection, the callable needs to take one argument, which will be the pandas object, and return a result that will select data from the dataset. Why would you want to do this? We’ll look at how this can be useful.
In this series I’ve been grabbing data from the Chicago Data Portal. For this post, I’ve grabbed the list of current employees for the city. This includes full time and part time, salaried and hourly data.
>>> import pandas as pd >>> >>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited >>> # it also only returns 1000 rows (or at least it did for me without an API key) >>> df = pd.read_json("https://data.cityofchicago.org/resource/xzkq-xp2w.json") >>> df.dtypes name object job_titles object department object full_or_part_time object salary_or_hourly object annual_salary float64 typical_hours float64 hourly_rate float64 dtype: object >>> df.describe() annual_salary typical_hours hourly_rate count 785.000000 215.000000 215.000000 mean 87307.076637 35.558140 34.706000 std 20342.094746 8.183932 13.027963 min 20568.000000 20.000000 3.000000 25% 76164.000000 40.000000 22.350000 50% 87006.000000 40.000000 38.350000 75% 97386.000000 40.000000 44.400000 max 180000.000000 40.000000 57.040000 >>> df.shape (1000, 8) >>> df = df.drop('name', axis=1) # no need to include personal info in this post
So we have some data, which is a subset of the total list of employees for the city of Chicago. The full dataset should be about 32,000 rows.
Before we give a few examples, let’s clarify what this callable should do. First, the callable will take one argument, which will be the
Series being indexed. What you need to returns a valid value for indexing. This could be any value that we’ve already discussed in earlier posts.
So, if we are using the array indexing operator, on a
DataFrame you’ll remember that you can pass in a single column, or a list of columns to select.
>>> def select_job_titles(df): ... return "job_titles" ... >>> df[select_job_titles] 0 SERGEANT 1 POLICE OFFICER (ASSIGNED AS DETECTIVE) 2 CHIEF CONTRACT EXPEDITER 3 CIVIL ENGINEER IV 4 CONCRETE LABORER ... 995 AVIATION SECURITY OFFICER 996 FIREFIGHTER-EMT 997 LIBRARIAN IV 998 HUMAN SERVICE SPECIALIST II 999 POLICE OFFICER Name: job_titles, Length: 1000, dtype: object >>> def select_job_titles_typical_hours(df): ... return ["job_titles", "typical_hours"] ... >>> df[select_job_titles_typical_hours].dropna() job_titles typical_hours 4 CONCRETE LABORER 40.0 6 TRAFFIC CONTROL AIDE-HOURLY 20.0 7 ELECTRICAL MECHANIC 40.0 10 FOSTER GRANDPARENT 20.0 21 ELECTRICAL MECHANIC (AUTOMOTIVE) 40.0 .. ... ... 971 CONSTRUCTION LABORER 40.0 974 HOISTING ENGINEER 40.0 977 CONSTRUCTION LABORER 40.0 988 CONSTRUCTION LABORER 40.0 991 SANITATION LABORER 40.0 [215 rows x 2 columns]
We can also return a boolean indexer, since that’s a valid argument.
>>> def select_20_hours_or_less(df): ... return df['typical_hours'] <= 20 ... >>> df[select_20_hours_or_less].head(1) job_titles department full_or_part_time salary_or_hourly annual_salary typical_hours hourly_rate 6 TRAFFIC CONTROL AIDE-HOURLY OEMC P Hourly NaN 20.0 19.86
You can also use callables for both the first (row indexer) and second (column indexer) arguments in a
>>> df.loc[lambda df: df['typical_hours'] <= 20, lambda df: ['job_titles', 'typical_hours']].head() job_titles typical_hours 6 TRAFFIC CONTROL AIDE-HOURLY 20.0 10 FOSTER GRANDPARENT 20.0 91 CROSSING GUARD 20.0 113 SENIOR COMPANION 20.0 125 TITLE V PROGRAM TRAINEE I 20.0
OK, so this all seems kind of unnecessary because you could do this much more directly. Why write a separate function to provide another level of redirection?
I have to admit that before writing this post, I don’t think that I’ve used callable indexing much, if at all. But one use case where it’s helpful is something that I do all the time. Maybe you do as well.
Let’s say we want to find departments with an average hourly pay rate below some threshold. Usually you’ll do a group by followed by a selector on the resulting groupby
>>> temp = df.groupby('job_titles').mean() >>> temp[temp['hourly_rate'] < 20].head() annual_salary typical_hours hourly_rate job_titles ALDERMANIC AIDE 41760.0 25.0 14.000 CROSSING GUARD - PER CBA NaN 20.0 15.195 CUSTODIAL WORKER NaN 40.0 19.200 FOSTER GRANDPARENT NaN 20.0 3.000 HOSPITALITY WORKER NaN 20.0 14.110
But with a callable, you can do this without the temporary
>>> df.groupby('job_titles').mean().loc[lambda df: df['hourly_rate'] < 20].head() annual_salary typical_hours hourly_rate job_titles ALDERMANIC AIDE 41760.0 25.0 14.000 CROSSING GUARD - PER CBA NaN 20.0 15.195 CUSTODIAL WORKER NaN 40.0 19.200 FOSTER GRANDPARENT NaN 20.0 3.000 HOSPITALITY WORKER NaN 20.0 14.110
One thing to note is that there’s nothing special about these callables. They still have to return the correct values for the selector you are choosing to use. So for example, you can do this with
>>> df.loc[lambda df: df['department'] == 'CITY COUNCIL'].head(1) job_titles department full_or_part_time salary_or_hourly annual_salary typical_hours hourly_rate 124 STUDENT INTERN - ALDERMANIC CITY COUNCIL F Hourly NaN 35.0 14.0
But you can’t do this, because
.iloc requires a boolean vector without an index (as I talked about in the post on boolean indexing.
>>> try: ... df.iloc[lambda df: df['department'] == 'CITY COUNCIL'] ... except NotImplementedError as nie: ... print(nie) ... iLocation based boolean indexing on an integer type is not available >>> df.iloc[lambda df: (df['department'] == 'CITY COUNCIL').values].head(1) job_titles department full_or_part_time salary_or_hourly annual_salary typical_hours hourly_rate 124 STUDENT INTERN - ALDERMANIC CITY COUNCIL F Hourly NaN 35.0 14.0 >>> # or >>> df.iloc[lambda df: (df['department'] == 'CITY COUNCIL').to_numpy()].head(1) job_titles department full_or_part_time salary_or_hourly annual_salary typical_hours hourly_rate 124 STUDENT INTERN - ALDERMANIC CITY COUNCIL F Hourly NaN 35.0 14.0
Also, while I’ve used the
DataFrame for all these examples, this works in
Series as well.
>>> s[lambda s: s < 30000] 175 20568.0 Name: annual_salary, dtype: float64
In summary, indexing with a callable allows some flexibity for condensing some code that would otherwise require temporary variables. The thing to remember about the callable technique is that it will need to return a result that is an acceptable argument in the same place as the callable.
I hope you’ll stay tuned for future updates. I’ll plan to talk about the
.where method of selection next.