This is the fifth post in a series on indexing and selecting in pandas. If you are jumping in the middle and want to get caught up, here’s what has been discussed so far:

- Basic indexing, selecting by label and location
- Slicing in pandas
- Selecting by boolean indexing
- Selecting by callable

Once the basics were covered in the first three posts we were able to move onto more detailed topics on how to select and update data. In this post, we’ll look at selecting using `where`

and `mask`

.

In the third post of this series, we covered the concept of boolean indexing. If you remember, boolean indexing allows us to essentially query our data (either a `DataFrame`

or a `Series`

) and return only the data that matches the boolean vector we use as our indexer. For example, to select odd values in a `Series`

like this:

>>> import pandas as pd >>> import numpy as np >>> >>> s = pd.Series(np.arange(10)) >>> s[s % 2 != 0] 1 1 3 3 5 5 7 7 9 9 dtype: int64

## Where?

You’ll notice that our result here is only 5 elements even though the original `Series`

contains 10 elements. This is the whole point of indexing, selecting the values you want. But what happens if you want the shape of your result to match your original data? In this case, you use `where`

. The values that are selected by the where condition are returned, the values that are not selected are set to `NaN`

. This way, the value you select has the same shape as your original data.

>>> s.where(s % 2 != 0) 0 NaN 1 1.0 2 NaN 3 3.0 4 NaN 5 5.0 6 NaN 7 7.0 8 NaN 9 9.0 dtype: float64

`where`

also accepts an optional argument for what you want the other values to be, if `NaN`

is not what you want, with some flexibility. For starters, it can be a scalar or `Series`

/`DataFrame`

.

>>> s.where(s % 2 != 0, -1) # set the non-matching elements to one value 0 -1 1 1 2 -1 3 3 4 -1 5 5 6 -1 7 7 8 -1 9 9 dtype: int64 >>> s.where(s % 2 != 0, -s) # set the non-matching elements to the negative value of the original series 0 0 1 1 2 -2 3 3 4 -4 5 5 6 -6 7 7 8 -8 9 9 dtype: int64

Both the first `condition`

argument and the `other`

can be a callable that accepts the `Series`

or `DataFrame`

and returns either a scalar or a `Series`

/`DataFrame`

. The condition callable should return boolean, the other can return whatever value you want for non selected values. So the above could be expressed (more verbosely) as

s.where(lambda x: x % 2 != 0, lambda x: x * -1)

Using `where`

will always return a copy of the existing data. But if you want to modify the original, you can by using the `inplace`

argument, similar to many other functions in pandas (like `fillna`

or `ffill`

and others).

>>> s.where(s % 2 != 0, -s, inplace=True) >>> s >>> s 0 0 1 1 2 -2 3 3 4 -4 5 5 6 -6 7 7 8 -8 9 9 dtype: int64

The `.mask`

method is just the inverse of `where`

. Instead of selecting values based on the condition, it selects values where the condition is `False`

. Everthing else is the same as above.

>>> s.mask(s % 2 != 0, 99) 0 0 1 99 2 -2 3 99 4 -4 5 99 6 -6 7 99 8 -8 9 99 dtype: int64

## Updating data

One thing that I noticed in writing this that I had missed before is that `where`

is the underlying implementation for boolean indexing with the array indexing operator, i.e. `[]`

on a `DataFrame`

. So you’ve already been using `where`

even if you didn’t know it.

This has implications for updating data. So with a `DataFrame`

of random floats around 0,

>>> df = pd.DataFrame(np.random.random_sample((5, 5)) - .5) >>> df 0 1 2 3 4 0 -0.326058 -0.205408 -0.394306 0.365862 0.141009 1 0.394965 0.283149 -0.014750 0.279396 -0.172909 2 -0.141023 -0.297178 -0.247611 -0.170736 0.229474 3 -0.276158 -0.438667 -0.290731 0.317484 -0.378233 4 -0.018927 0.354160 -0.254558 -0.056842 -0.245184 >>> df.where(df < 0) # these two are equivalent >>> df[df < 0] 0 1 2 3 4 0 -0.326058 -0.205408 -0.394306 NaN NaN 1 NaN NaN -0.014750 NaN -0.172909 2 -0.141023 -0.297178 -0.247611 -0.170736 NaN 3 -0.276158 -0.438667 -0.290731 NaN -0.378233 4 -0.018927 NaN -0.254558 -0.056842 -0.245184

You can also do updates. This is not necessarily that practical for most `DataFrame`

s I work with though, because you I rarely have a `DataFrame`

where I want to update across all the columns like this. But for some instances that might be useful, so here’s an example. We could force all the values to be positive by inverting only the negative values.

>>> df[df < 0] = -df >>> df 0 1 2 3 4 0 0.326058 0.205408 0.394306 0.365862 0.141009 1 0.394965 0.283149 0.014750 0.279396 0.172909 2 0.141023 0.297178 0.247611 0.170736 0.229474 3 0.276158 0.438667 0.290731 0.317484 0.378233 4 0.018927 0.354160 0.254558 0.056842 0.245184

## NumPy `.where`

If you’re a NumPy user, you are probably familiar with `np.where`

. To use it, you supply a condition and optional `x`

and `y`

values for `True`

and `False`

results in the condition. This is a bit different than using `where`

in pandas, where the object itself provides data for the `True`

result, with an optional `False`

result (which defaults to `NaN`

if not supplied). So here’s how you’d use it to select odd values in our `Series`

, and set the even values to `99`

.

>>> np.where(s % 2 != 0, s, 99) array([99, 1, 99, 3, 99, 5, 99, 7, 99, 9])

Another way to think about this is that the pandas implementation can be used like the NumPy version, just think of the `self`

argument of the `DataFrame`

as the `x`

argument in NumPy.

>>> pd.Series.where(cond=s % 2 != 0, self=s, other=99) 0 99 1 1 2 99 3 3 4 99 5 5 6 99 7 7 8 99 9 9 dtype: int64

## More examples

Let’s go back to a data set from a previous post. This is salary info for City of Chicago employees for both hourly and salaried employees.

One thing I noticed about this data set last time was that there were a lot of `NaN`

values because of the different treatment of salaried and hourly employees. As a result, there’s a column for annual salary, and separate columns for typical hours and hourly rates. What if we just want to know what a typical full salary would be for any employee, regardless of their category?

Using `where`

is one way we could address this if we wanted a uniform data set. Now we won’t apply it to the entire `DataFrame`

as the update example above, we’ll use it to create one column.

>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited >>> # it also only returns 1000 rows (or at least it did for me without an API key) >>> sal = pd.read_json("https://data.cityofchicago.org/resource/xzkq-xp2w.json") sal = sal.drop('name', axis=1) # remove personal info >>> sal = sal.drop('name', axis=1) # remove personal info >>> sal['total_pay'] = sal['annual_salary'].where(sal['salary_or_hourly'] == 'Salary', sal['typical_hours'] * sal['hourly_rate'] * 52)

Another way to do this, would be to selectively update only rows for hourly workers. But to do this, you end up needing to apply a mask multiple times.

>>> sal['total_pay2'] = sal['annual_salary'] >>> mask = sal['salary_or_hourly'] != 'Salary' >>> sal.loc[mask, 'total_pay2'] = sal.loc[mask, 'typical_hours'] * sal.loc[mask, 'hourly_rate'] * 52

So using `where`

can result in a slightly more simple expression, even if it’s a little long.

## NumPy `where`

and `select`

for more complicated updates

There are times where you want to create new columns with some sort of complicated condition on a dataframe that might need to be applied across multiple columns. Using NumPy `where`

can be helpful for these situations. For example, we can creating an hourly rate column that calculates an hourly equivalent for the salried employees, but use the existing hourly rate.

>>> sal['hourly_rate_all'] = np.where(sal['salary_or_hourly'] == 'Salary', sal['annual_salary'] / (52 * 40), sal['hourly_rate'])

If you have a much more complex scenario, you can use `np.select`

. Think of `np.select`

as a where with multiple conditions and multiple choices, as opposed to just one condition with two choices. For example, let’s say that the hourly rate for employees in the police and fire departments was slightly different because of their shift schedule, so their calculation was different. We could do the calculation in one pass. Note that I chose to use the hourly rate as the default (using the last parameter), but could have just as easily made it a third condition.

>>> conditions = [ ... (sal['salary_or_hourly'] == 'Salary') & (sal['department'].isin(['POLICE', 'FIRE'])), ... (sal['salary_or_hourly'] == 'Salary') & (~sal['department'].isin(['POLICE', 'FIRE'])), ... ] >>> choices = [ ... sal['annual_salary'] / (26 * 75), ... sal['annual_salary'] / (52 * 40) ... ] >>> sal['hourly_rate_all2'] = np.select(conditions, choices, sal['hourly_rate']) >>> sal.head()[['department', 'hourly_rate_all', 'hourly_rate_all2']] department hourly_rate_all hourly_rate_all2 0 POLICE 53.578846 57.150769 1 POLICE 45.250962 48.267692 2 DAIS 57.023077 57.023077 3 WATER MGMNT 56.284615 56.284615 4 TRANSPORTN 44.400000 44.400000

In summary, using pandas’ `where`

and `mask`

methods can be useful ways to select and update data in the same shape as your original data. Using NumPy’s `where`

and `select`

can also be very useful for more complicated scenarios.

Stay tuned for the next post in this series where I’ll look at the `query`

method.