Selecting in Pandas using where and mask

This is the fifth post in a series on indexing and selecting in pandas. If you are jumping in the middle and want to get caught up, here’s what has been discussed so far:

Once the basics were covered in the first three posts we were able to move onto more detailed topics on how to select and update data. In this post, we’ll look at selecting using where and mask.

In the third post of this series, we covered the concept of boolean indexing. If you remember, boolean indexing allows us to essentially query our data (either a DataFrame or a Series) and return only the data that matches the boolean vector we use as our indexer. For example, to select odd values in a Series like this:

>>> import pandas as pd
>>> import numpy as np
>>>
>>> s = pd.Series(np.arange(10))
>>> s[s % 2 != 0]
1    1
3    3
5    5
7    7
9    9
dtype: int64

Where?

You’ll notice that our result here is only 5 elements even though the original Series contains 10 elements. This is the whole point of indexing, selecting the values you want. But what happens if you want the shape of your result to match your original data? In this case, you use where. The values that are selected by the where condition are returned, the values that are not selected are set to NaN. This way, the value you select has the same shape as your original data.

>>> s.where(s % 2 != 0)
0    NaN
1    1.0
2    NaN
3    3.0
4    NaN
5    5.0
6    NaN
7    7.0
8    NaN
9    9.0
dtype: float64

where also accepts an optional argument for what you want the other values to be, if NaN is not what you want, with some flexibility. For starters, it can be a scalar or Series/DataFrame.

>>> s.where(s % 2 != 0, -1)  # set the non-matching elements to one value
0   -1
1    1
2   -1
3    3
4   -1
5    5
6   -1
7    7
8   -1
9    9
dtype: int64
>>> s.where(s % 2 != 0, -s)   # set the non-matching elements to the negative value of the original series
0    0
1    1
2   -2
3    3
4   -4
5    5
6   -6
7    7
8   -8
9    9
dtype: int64

Both the first condition argument and the other can be a callable that accepts the Series or DataFrame and returns either a scalar or a Series/DataFrame. The condition callable should return boolean, the other can return whatever value you want for non selected values. So the above could be expressed (more verbosely) as

s.where(lambda x: x % 2 != 0, lambda x: x * -1)

Using where will always return a copy of the existing data. But if you want to modify the original, you can by using the inplace argument, similar to many other functions in pandas (like fillna or ffill and others).

>>> s.where(s % 2 != 0, -s, inplace=True)
>>> s
>>> s
0    0
1    1
2   -2
3    3
4   -4
5    5
6   -6
7    7
8   -8
9    9
dtype: int64

The .mask method is just the inverse of where. Instead of selecting values based on the condition, it selects values where the condition is False. Everthing else is the same as above.

>>> s.mask(s % 2 != 0, 99)
0     0
1    99
2    -2
3    99
4    -4
5    99
6    -6
7    99
8    -8
9    99
dtype: int64

Updating data

One thing that I noticed in writing this that I had missed before is that where is the underlying implementation for boolean indexing with the array indexing operator, i.e. [] on a DataFrame. So you’ve already been using where even if you didn’t know it.

This has implications for updating data. So with a DataFrame of random floats around 0,

>>> df = pd.DataFrame(np.random.random_sample((5, 5)) - .5)
>>> df
          0         1         2         3         4
0 -0.326058 -0.205408 -0.394306  0.365862  0.141009
1  0.394965  0.283149 -0.014750  0.279396 -0.172909
2 -0.141023 -0.297178 -0.247611 -0.170736  0.229474
3 -0.276158 -0.438667 -0.290731  0.317484 -0.378233
4 -0.018927  0.354160 -0.254558 -0.056842 -0.245184
>>> df.where(df < 0) # these two are equivalent
>>> df[df < 0]
          0         1         2         3         4
0 -0.326058 -0.205408 -0.394306       NaN       NaN
1       NaN       NaN -0.014750       NaN -0.172909
2 -0.141023 -0.297178 -0.247611 -0.170736       NaN
3 -0.276158 -0.438667 -0.290731       NaN -0.378233
4 -0.018927       NaN -0.254558 -0.056842 -0.245184

You can also do updates. This is not necessarily that practical for most DataFrames I work with though, because you I rarely have a DataFrame where I want to update across all the columns like this. But for some instances that might be useful, so here’s an example. We could force all the values to be positive by inverting only the negative values.

>>> df[df < 0] = -df
>>> df
          0         1         2         3         4
0  0.326058  0.205408  0.394306  0.365862  0.141009
1  0.394965  0.283149  0.014750  0.279396  0.172909
2  0.141023  0.297178  0.247611  0.170736  0.229474
3  0.276158  0.438667  0.290731  0.317484  0.378233
4  0.018927  0.354160  0.254558  0.056842  0.245184

NumPy .where

If you’re a NumPy user, you are probably familiar with np.where. To use it, you supply a condition and optional x and y values for True and False results in the condition. This is a bit different than using where in pandas, where the object itself provides data for the True result, with an optional False result (which defaults to NaN if not supplied). So here’s how you’d use it to select odd values in our Series, and set the even values to 99.

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address
>>> np.where(s % 2 != 0, s, 99)
array([99,  1, 99,  3, 99,  5, 99,  7, 99,  9])

Another way to think about this is that the pandas implementation can be used like the NumPy version, just think of the self argument of the DataFrame as the x argument in NumPy.

>>> pd.Series.where(cond=s % 2 != 0, self=s, other=99)
0    99
1     1
2    99
3     3
4    99
5     5
6    99
7     7
8    99
9     9
dtype: int64

More examples

Let’s go back to a data set from a previous post. This is salary info for City of Chicago employees for both hourly and salaried employees.

One thing I noticed about this data set last time was that there were a lot of NaN values because of the different treatment of salaried and hourly employees. As a result, there’s a column for annual salary, and separate columns for typical hours and hourly rates. What if we just want to know what a typical full salary would be for any employee, regardless of their category?

Using where is one way we could address this if we wanted a uniform data set. Now we won’t apply it to the entire DataFrame as the update example above, we’ll use it to create one column.

>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited
>>> # it also only returns 1000 rows (or at least it did for me without an API key)
>>> sal = pd.read_json("https://data.cityofchicago.org/resource/xzkq-xp2w.json")
sal = sal.drop('name', axis=1) # remove personal info
>>> sal = sal.drop('name', axis=1) # remove personal info
>>> sal['total_pay'] = sal['annual_salary'].where(sal['salary_or_hourly'] == 'Salary',
                                                  sal['typical_hours'] * sal['hourly_rate'] * 52)

Another way to do this, would be to selectively update only rows for hourly workers. But to do this, you end up needing to apply a mask multiple times.

>>> sal['total_pay2'] = sal['annual_salary']
>>> mask = sal['salary_or_hourly'] != 'Salary'
>>> sal.loc[mask, 'total_pay2'] = sal.loc[mask, 'typical_hours'] * sal.loc[mask, 'hourly_rate'] * 52

So using where can result in a slightly more simple expression, even if it’s a little long.

NumPy where and select for more complicated updates

There are times where you want to create new columns with some sort of complicated condition on a dataframe that might need to be applied across multiple columns. Using NumPy where can be helpful for these situations. For example, we can creating an hourly rate column that calculates an hourly equivalent for the salried employees, but use the existing hourly rate.

>>> sal['hourly_rate_all'] = np.where(sal['salary_or_hourly'] == 'Salary',
                                      sal['annual_salary'] / (52 * 40),
                                      sal['hourly_rate'])

If you have a much more complex scenario, you can use np.select. Think of np.select as a where with multiple conditions and multiple choices, as opposed to just one condition with two choices. For example, let’s say that the hourly rate for employees in the police and fire departments was slightly different because of their shift schedule, so their calculation was different. We could do the calculation in one pass. Note that I chose to use the hourly rate as the default (using the last parameter), but could have just as easily made it a third condition.

>>> conditions = [
...     (sal['salary_or_hourly'] == 'Salary') & (sal['department'].isin(['POLICE', 'FIRE'])),
...     (sal['salary_or_hourly'] == 'Salary') & (~sal['department'].isin(['POLICE', 'FIRE'])),
... ]
>>> choices = [
...     sal['annual_salary'] / (26 * 75),
...     sal['annual_salary'] / (52 * 40)
... ]
>>> sal['hourly_rate_all2'] = np.select(conditions, choices, sal['hourly_rate'])
>>> sal.head()[['department', 'hourly_rate_all', 'hourly_rate_all2']]
    department  hourly_rate_all  hourly_rate_all2
0       POLICE        53.578846         57.150769
1       POLICE        45.250962         48.267692
2         DAIS        57.023077         57.023077
3  WATER MGMNT        56.284615         56.284615
4   TRANSPORTN        44.400000         44.400000

In summary, using pandas’ where and mask methods can be useful ways to select and update data in the same shape as your original data. Using NumPy’s where and select can also be very useful for more complicated scenarios.

Stay tuned for the next post in this series where I’ll look at the query method.

Have anything to say about this topic?