This is the fifth post in a series on indexing and selecting in pandas. If you are jumping in the middle and want to get caught up, here’s what has been discussed so far:
- Basic indexing, selecting by label and location
- Slicing in pandas
- Selecting by boolean indexing
- Selecting by callable
Once the basics were covered in the first three posts we were able to move onto more detailed topics on how to select and update data. In this post, we’ll look at selecting using
In the third post of this series, we covered the concept of boolean indexing. If you remember, boolean indexing allows us to essentially query our data (either a
DataFrame or a
Series) and return only the data that matches the boolean vector we use as our indexer. For example, to select odd values in a
Series like this:
>>> import pandas as pd >>> import numpy as np >>> >>> s = pd.Series(np.arange(10)) >>> s[s % 2 != 0] 1 1 3 3 5 5 7 7 9 9 dtype: int64
You’ll notice that our result here is only 5 elements even though the original
Series contains 10 elements. This is the whole point of indexing, selecting the values you want. But what happens if you want the shape of your result to match your original data? In this case, you use
where. The values that are selected by the where condition are returned, the values that are not selected are set to
NaN. This way, the value you select has the same shape as your original data.
>>> s.where(s % 2 != 0) 0 NaN 1 1.0 2 NaN 3 3.0 4 NaN 5 5.0 6 NaN 7 7.0 8 NaN 9 9.0 dtype: float64
where also accepts an optional argument for what you want the other values to be, if
NaN is not what you want, with some flexibility. For starters, it can be a scalar or
>>> s.where(s % 2 != 0, -1) # set the non-matching elements to one value 0 -1 1 1 2 -1 3 3 4 -1 5 5 6 -1 7 7 8 -1 9 9 dtype: int64 >>> s.where(s % 2 != 0, -s) # set the non-matching elements to the negative value of the original series 0 0 1 1 2 -2 3 3 4 -4 5 5 6 -6 7 7 8 -8 9 9 dtype: int64
Both the first
condition argument and the
other can be a callable that accepts the
DataFrame and returns either a scalar or a
DataFrame. The condition callable should return boolean, the other can return whatever value you want for non selected values. So the above could be expressed (more verbosely) as
s.where(lambda x: x % 2 != 0, lambda x: x * -1)
where will always return a copy of the existing data. But if you want to modify the original, you can by using the
inplace argument, similar to many other functions in pandas (like
ffill and others).
>>> s.where(s % 2 != 0, -s, inplace=True) >>> s >>> s 0 0 1 1 2 -2 3 3 4 -4 5 5 6 -6 7 7 8 -8 9 9 dtype: int64
.mask method is just the inverse of
where. Instead of selecting values based on the condition, it selects values where the condition is
False. Everthing else is the same as above.
>>> s.mask(s % 2 != 0, 99) 0 0 1 99 2 -2 3 99 4 -4 5 99 6 -6 7 99 8 -8 9 99 dtype: int64
One thing that I noticed in writing this that I had missed before is that
where is the underlying implementation for boolean indexing with the array indexing operator, i.e.
 on a
DataFrame. So you’ve already been using
where even if you didn’t know it.
This has implications for updating data. So with a
DataFrame of random floats around 0,
>>> df = pd.DataFrame(np.random.random_sample((5, 5)) - .5) >>> df 0 1 2 3 4 0 -0.326058 -0.205408 -0.394306 0.365862 0.141009 1 0.394965 0.283149 -0.014750 0.279396 -0.172909 2 -0.141023 -0.297178 -0.247611 -0.170736 0.229474 3 -0.276158 -0.438667 -0.290731 0.317484 -0.378233 4 -0.018927 0.354160 -0.254558 -0.056842 -0.245184 >>> df.where(df < 0) # these two are equivalent >>> df[df < 0] 0 1 2 3 4 0 -0.326058 -0.205408 -0.394306 NaN NaN 1 NaN NaN -0.014750 NaN -0.172909 2 -0.141023 -0.297178 -0.247611 -0.170736 NaN 3 -0.276158 -0.438667 -0.290731 NaN -0.378233 4 -0.018927 NaN -0.254558 -0.056842 -0.245184
You can also do updates. This is not necessarily that practical for most
DataFrames I work with though, because you I rarely have a
DataFrame where I want to update across all the columns like this. But for some instances that might be useful, so here’s an example. We could force all the values to be positive by inverting only the negative values.
>>> df[df < 0] = -df >>> df 0 1 2 3 4 0 0.326058 0.205408 0.394306 0.365862 0.141009 1 0.394965 0.283149 0.014750 0.279396 0.172909 2 0.141023 0.297178 0.247611 0.170736 0.229474 3 0.276158 0.438667 0.290731 0.317484 0.378233 4 0.018927 0.354160 0.254558 0.056842 0.245184
If you’re a NumPy user, you are probably familiar with
np.where. To use it, you supply a condition and optional
y values for
False results in the condition. This is a bit different than using
where in pandas, where the object itself provides data for the
True result, with an optional
False result (which defaults to
NaN if not supplied). So here’s how you’d use it to select odd values in our
Series, and set the even values to
>>> np.where(s % 2 != 0, s, 99) array([99, 1, 99, 3, 99, 5, 99, 7, 99, 9])
Another way to think about this is that the pandas implementation can be used like the NumPy version, just think of the
self argument of the
DataFrame as the
x argument in NumPy.
>>> pd.Series.where(cond=s % 2 != 0, self=s, other=99) 0 99 1 1 2 99 3 3 4 99 5 5 6 99 7 7 8 99 9 9 dtype: int64
Let’s go back to a data set from a previous post. This is salary info for City of Chicago employees for both hourly and salaried employees.
One thing I noticed about this data set last time was that there were a lot of
NaN values because of the different treatment of salaried and hourly employees. As a result, there’s a column for annual salary, and separate columns for typical hours and hourly rates. What if we just want to know what a typical full salary would be for any employee, regardless of their category?
where is one way we could address this if we wanted a uniform data set. Now we won’t apply it to the entire
DataFrame as the update example above, we’ll use it to create one column.
>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited >>> # it also only returns 1000 rows (or at least it did for me without an API key) >>> sal = pd.read_json("https://data.cityofchicago.org/resource/xzkq-xp2w.json") sal = sal.drop('name', axis=1) # remove personal info >>> sal = sal.drop('name', axis=1) # remove personal info >>> sal['total_pay'] = sal['annual_salary'].where(sal['salary_or_hourly'] == 'Salary', sal['typical_hours'] * sal['hourly_rate'] * 52)
Another way to do this, would be to selectively update only rows for hourly workers. But to do this, you end up needing to apply a mask multiple times.
>>> sal['total_pay2'] = sal['annual_salary'] >>> mask = sal['salary_or_hourly'] != 'Salary' >>> sal.loc[mask, 'total_pay2'] = sal.loc[mask, 'typical_hours'] * sal.loc[mask, 'hourly_rate'] * 52
where can result in a slightly more simple expression, even if it’s a little long.
select for more complicated updates
There are times where you want to create new columns with some sort of complicated condition on a dataframe that might need to be applied across multiple columns. Using NumPy
where can be helpful for these situations. For example, we can creating an hourly rate column that calculates an hourly equivalent for the salried employees, but use the existing hourly rate.
>>> sal['hourly_rate_all'] = np.where(sal['salary_or_hourly'] == 'Salary', sal['annual_salary'] / (52 * 40), sal['hourly_rate'])
If you have a much more complex scenario, you can use
np.select. Think of
np.select as a where with multiple conditions and multiple choices, as opposed to just one condition with two choices. For example, let’s say that the hourly rate for employees in the police and fire departments was slightly different because of their shift schedule, so their calculation was different. We could do the calculation in one pass. Note that I chose to use the hourly rate as the default (using the last parameter), but could have just as easily made it a third condition.
>>> conditions = [ ... (sal['salary_or_hourly'] == 'Salary') & (sal['department'].isin(['POLICE', 'FIRE'])), ... (sal['salary_or_hourly'] == 'Salary') & (~sal['department'].isin(['POLICE', 'FIRE'])), ... ] >>> choices = [ ... sal['annual_salary'] / (26 * 75), ... sal['annual_salary'] / (52 * 40) ... ] >>> sal['hourly_rate_all2'] = np.select(conditions, choices, sal['hourly_rate']) >>> sal.head()[['department', 'hourly_rate_all', 'hourly_rate_all2']] department hourly_rate_all hourly_rate_all2 0 POLICE 53.578846 57.150769 1 POLICE 45.250962 48.267692 2 DAIS 57.023077 57.023077 3 WATER MGMNT 56.284615 56.284615 4 TRANSPORTN 44.400000 44.400000
In summary, using pandas’
mask methods can be useful ways to select and update data in the same shape as your original data. Using NumPy’s
select can also be very useful for more complicated scenarios.
Stay tuned for the next post in this series where I’ll look at the