Boolean Indexing in Pandas

This is the third post in the series on indexing and selecting data in pandas. If you haven’t read the others yet, see the first post that covers the basics of selecting based on index or relative numerical indexing, and the second post, that talks about slicing. In this post, I’m going to talk about boolean indexing which is the way that I usually select subsets of data when I work with pandas.

What is boolean indexing?

For those familiar with NumPy, this may already be second nature, but for beginners it is not so obvious. Boolean indexing works for a given array by passing a boolean vector into the indexing operator ([]), returning all values that are True.

One thing to note, this array needs to be the same length as the array dimension being indexed.

Let’s look at an example.

>>> import pandas as pd
>>> import numpy as np
>>>
>>> a = np.arange(5)
>>> a
array([0, 1, 2, 3, 4])

Now we can select the first, second and last elements of our array using a list of array indices.

>>> a[[0, 1, 4]]
array([0, 1, 4])

Boolean indexing can do the same, by creating a boolean array of the same size as the entire array, with elements 0, 1 and 4 set to True, all others False.

>>> mask = np.array([True, True, False, False, True])
>>> a[mask]
array([0, 1, 4])

Boolean operators

So now we know how to index our array with a single boolean array. But building that array by hand is a pain, so what you will usually end up doing is applying operations to the original array that return a boolean array themselves.

For example, to select all elements less than 3:

>>> a[a < 3]
array([0, 1, 2])

or all even elements:

>>> a[a % 2 == 0]
array([0, 2, 4])

And we can combine these using expressions, to AND them (&) or OR them (|). With these operators, we can select the same elements from our first example.

>>> a[(a < 2) | (a >= 4)]
array([0, 1, 4])

Another very helpful operators is the inverse or not operator, (~). Remember to watch your parentheses.

>>> a[~((a < 2) | (a >= 4))]
array([2, 3])

On to pandas

In pandas, boolean indexing works pretty much like in NumPy, especially in a Series. You pass in a vector the same length as the Series. Note that this vector doesn’t have to have an index, but if you use a Series as the argument, it does have an index so you need to be aware of how your index aligns. A common method of using boolean indexing is to apply functions to the original Series so your index will always match.

Series

>>> s = pd.Series(np.arange(5), index=list("abcde"))
>>> s
a    0
b    1
c    2
d    3
e    4
dtype: int64
>>> s[[True, True, False, False, True]]            # this vector is just a list of boolean values
a    0
b    1
e    4
dtype: int64
>>> s[np.array([True, True, False, False, True])]  # this vector is a NumPy array of boolean values
a    0
b    1
e    4
dtype: int64

But, since our index is not the default (i.e. not a RangeIndex), if we use another Series of the same length, it will not work. It needs a matching index, and the default index created below doesn’t match our character index. So we have to specify the index to match.

>>> try:
...     s[pd.Series([True, True, False, False, True])]
... except Exception as ie:
...     print(ie)

Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
>>> s[pd.Series([True, True, False, False, True], index=list("abcde"))]
a    0
b    1
e    4
dtype: int64

But instead of making a new Series, we’ll just base all of our expressions on our source data Series or DataFrame, then they’ll share an index.

>>> # just like before with NumPy
>>> s[(s < 2) | (s > 3)]
a    0
b    1
e    4
dtype: int64

Make note that you need to surround each expression with parentheses because the Python parser will apply the boolean operators incorrectly. For the example above, it would apply it as s < (2 | s ) < 3. You’ll realize you’re forgetting parentheses when you get complaints about the boolean operators being applied to a series. See?

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address
>>> s[s < 2 | s > 3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mcw/.pyenv/versions/pandas/lib/python3.8/site-packages/pandas/core/generic.py", line 1329, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

DataFrame

We can also do boolean indexing on DataFrames. A popular way to create the boolean vector is to use one or more of the columns of the DataFrame.

>>> df = pd.DataFrame({'x': np.arange(5), 'y': np.arange(5, 10)})
>>> df[df['x'] < 3]
   x  y
0  0  5
1  1  6
2  2  7

You can also supply multiple conditions, just like before with Series. (Remember those parentheses!)

>>> df[(df['x'] < 3) & (df['y'] > 5)]
   x  y
1  1  6
2  2  7

.loc.iloc, and []

If you remember from previous posts, pandas has three primary ways to index the containers. The indexing operator ([]) is sort of a hybrid of using the index labels or location based offsets. .loc is meant for using the index labels, .iloc is for integer based indexing. The good news is that all of them accept boolean arrays, and return subsets of the underlying container.

>>> mask = (df['x'] < 3) & (df['y'] > 5)
>>> mask
0    False
1     True
2     True
3    False
4    False
dtype: bool
>>> df[mask]
   x  y
1  1  6
2  2  7
>>> df.loc[mask]
   x  y
1  1  6
2  2  7

Note that .iloc is a little different than the others, if you pass in this mask, you’ll get an exception.

>>> try:
...     df.iloc[mask]
... except NotImplementedError as nie:
...     print(nie)
...
iLocation based boolean indexing on an integer type is not available

This is by design, .iloc is only intended to take positional arguments. However, our mask is a Series with an index, so it is rejected. You can still pass in a boolean vector, but just pass in the vector itself without the index.

>>> df.iloc[mask.to_numpy()]
   x  y
1  1  6
2  2  7
>>> # or
>>> df.iloc[mask.values]
   x  y
1  1  6
2  2  7

Examples!

I think one of the most helpful things when thinking about boolean indexing is to see some examples. You are only limited by what you can express by grouping together any combination of expressions on your data. You do this by carefully grouping your boolean expressions and using parentheses wisely. It can also help to break the problem into pieces as you work on it.

In this series I’ve been grabbing data from the Chicago Data Portal. This time, I thought the list of lobbyists might be interesting. Due to the need for lobbyists to re-register every year, there’s some repeating data. Let’s take a look.

>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited
>>> lbys = pd.read_json("https://data.cityofchicago.org/resource/tq3e-t5yq.json")
>>> lbys.dtypes
year               int64
lobbyist_id        int64
salutation        object
first_name        object
last_name         object
address_1         object
city              object
state             object
zip               object
country           object
email             object
phone             object
fax               object
employer_id        int64
employer_name     object
created_date      object
middle_initial    object
address_2         object
suffix            object
dtype: object
>>> lbys['created_date'] = pd.to_datetime(lbys['created_date'])
>>> # I'll drop the personally identifiable data, just to be nice
>>> lbys = lbys.drop(['email', 'phone', 'fax', 'last_name'], axis=1)
>>>
>>> lbys.head(3)
   year  lobbyist_id salutation first_name            address_1  ...              employer_name created_date middle_initial address_2  suffix
0  2020        18883        MR.   PERICLES  6969 W WABANSIA AVE  ...  THE PERICLES ORGANIZATION   2020-08-07            NaN       NaN     NaN
1  2019        18883        NaN   PERICLES  6969 W WABANSIA AVE  ...  THE PERICLES ORGANIZATION   2020-01-21            NaN       NaN     NaN
2  2018        18883        NaN   PERICLES  6969 W WABANSIA AVE  ...  THE PERICLES ORGANIZATION   2018-12-12            NaN       NaN     NaN

[3 rows x 15 columns]

In terms of examples, there’s not really too much complexity to deal with, but here’s a few to give you an idea what boolean indexing looks like.

>>> lbys[lbys['year'] == 2020]  # all lobbyists registered in 2020
     year  lobbyist_id salutation first_name  ... created_date middle_initial   address_2 suffix
0    2020        18883        MR.   PERICLES  ...   2020-08-07            NaN         NaN    NaN
4    2020        17521        MR.      STEVE  ...   2020-01-15            NaN         NaN    NaN
..    ...          ...        ...        ...  ...          ...            ...         ...    ...
998  2020        24740       MRS.  CHRISONIA  ...   2020-01-16             D.  SUITE 1700    NaN
999  2020        15081        MS.        LIZ  ...   2020-05-29            NaN    STE. 900    NaN

[125 rows x 15 columns]
>>> lbys[(lbys['year'] == 2020) & (lbys['city'] == 'CHICAGO')] # lobbyists registered in 2020 from Chicago
     year  lobbyist_id salutation first_name  ... created_date middle_initial   address_2 suffix
0    2020        18883        MR.   PERICLES  ...   2020-08-07            NaN         NaN    NaN
4    2020        17521        MR.      STEVE  ...   2020-01-15            NaN         NaN    NaN
..    ...          ...        ...        ...  ...          ...            ...         ...    ...
998  2020        24740       MRS.  CHRISONIA  ...   2020-01-16             D.  SUITE 1700    NaN
999  2020        15081        MS.        LIZ  ...   2020-05-29            NaN    STE. 900    NaN

[76 rows x 15 columns]
>>> # let's get the most popular employer for 2020
>>> pop_emp_id = lbys[lbys['year'] == 2020].groupby('employer_id').count().sort_values(by='lobbyist_id', ascending=False).index[0]
>>> # who works for them?
>>> lbys[(lbys['employer_id'] == pop_emp_id) & (lbys['year'] == 2020)]
     year  lobbyist_id salutation first_name          address_1  ...                         employer_name created_date middle_initial  address_2  suffix
619  2020        24484        NaN      BRIAN   1330 W FULTON ST  ...  STERLING BAY, LLC AND ITS AFFILIATES   2020-01-15            NaN    STE 800     NaN
654  2020        24106        NaN     HOWARD  1330 W. FULTON ST  ...  STERLING BAY, LLC AND ITS AFFILIATES   2020-01-17            NaN  SUITE 800     NaN
982  2020        23828        MS.     SHELLY  1330 W. FULTON ST  ...  STERLING BAY, LLC AND ITS AFFILIATES   2020-01-15            NaN  SUITE 800     NaN

[3 rows x 15 columns]

If we only want to deal with 2020 data, we can just make a new smaller DataFrame with that data.

lbys = lbys[lbys['year'] == 2020]

Boolean indexing with isin

A helpful method that is often paired with boolean indexing is Series.isin. It returns a boolean vector with all rows that match one of the elements in the arguments.

>>> lbys['state'].tail()
995    MO
996    MO
997    IL
998    IL
999    IL
Name: state, dtype: object
>>> lbys['state'].isin(['IL']).tail()
995    False
996    False
997     True
998     True
999     True
Name: state, dtype: bool
>>> lbys[lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])] # lobbyists from bordering states
     year  lobbyist_id salutation first_name  ... created_date middle_initial      address_2 suffix
164  2020        24879        NaN        DAN  ...   2020-06-15            NaN            NaN    NaN
786  2012         4644        MR.      BRIAN  ...   2012-05-25            NaN            NaN    NaN
788  2019        15543        MR.    LORENZO  ...   2019-01-10            NaN            NaN    NaN
789  2018        15543        MR.    LORENZO  ...   2018-01-17            NaN            NaN    NaN
790  2017        15543        MR.    LORENZO  ...   2017-01-19            NaN            NaN    NaN
791  2020        15543        MR.    LORENZO  ...   2020-01-10            NaN            NaN    NaN
792  2016        15543        MR.    LORENZO  ...   2016-10-13            NaN            NaN    NaN
994  2016        10222        MS.    MARILYN  ...   2016-01-13            NaN  MO1-800-14-40    NaN
995  2015        10222        MS.    MARILYN  ...   2015-01-14            NaN  MO1-800-14-40    NaN
996  2014        10222        MS.    MARILYN  ...   2014-05-06            NaN  MO1-800-14-40    NaN

[10 rows x 15 columns]

I’ll wrap it up with a slightly more complicated expression.

>>> lbys[
...     ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])) & # lobbyists NOT from bordering states
...     (lbys['state'] != 'IL') &                               # and NOT from IL
...     (lbys['created_date'] >= '2020-07-01')                  # created in the last half of the year
...     ]
     year  lobbyist_id salutation first_name  ... created_date middle_initial       address_2 suffix
26   2021        24967        NaN        JON  ...   2020-12-27            NaN       SUITE 104    NaN
27   2020        24967        NaN        JON  ...   2020-10-21            NaN       SUITE 104    NaN
34   2021        24901        NaN     JOSEPH  ...   2020-12-20            NaN             NaN    NaN
160  2021        23782        MR.       ALAN  ...   2021-01-01              P   SUITE 404-352    NaN
227  2020        24909        MS.   LAKEITHA  ...   2020-07-14            NaN             NaN    NaN
332  2021        22341        NaN   CAROLINE  ...   2020-12-23            NaN             NaN    NaN
511  2021         5361        NaN        JAY  ...   2020-12-27            NaN       SUITE 450    NaN
609  2021         4561        NaN       BRET  ...   2020-12-18            NaN             NaN    NaN
660  2021        13801        MR.       JOHN  ...   2020-12-27            NaN  MC 482-C30-C76    NaN
700  2020        24925        NaN       TAMI  ...   2020-08-06            NaN             NaN    NaN
862  2020        24969        MR.       ALEX  ...   2020-08-31            NaN             NaN    NaN
951  2021         6164        MS.  GABRIELLE  ...   2020-12-21            NaN      73RD FLOOR    NaN

[12 rows x 15 columns]

I also find it helpful to sometimes create a variable for storing the mask. So for the above example, instead of having to parse the entire expression when reading the code, it can be helpful to have expressive variable names for the parts of the indexing expression.

>>> non_bordering = ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN']))
>>> non_illinois = (lbys['state'] != 'IL')
>>>
>>> # more readable maybe?
>>> lbys[non_bordering & non_illinois & (lbys['created_date'] >= '2020-07-01')]

Often when building a complex expression, it can be helpful to build it in pieces, so assigning parts of the mask to variables can make a much more complicated expression easier to read, at the cost of extra variables to deal with. In general, I use variables in the mask if I have to reuse them multiple times, but if only used once, I do the entire expression in place.

In summary, boolean indexing is really quite simple, but powerful. I’ll be looking at a few other ways to select data in pandas in upcoming posts.

Have anything to say about this topic?