Indexing and Selecting in Pandas – slicing

Slicing data in pandas

This is second in the series on indexing and selecting data in pandas. If you haven’t read it yet, see the first post that covers the basics of selecting based on index or relative numerical indexing. In this post, I’m going to review slicing, which is a core Python topic, but has a few subtle issues related to pandas. Then, using slicing, we’ll review the selection methods from the last post, but now with slicing applied.

What is a slice, and what is slicing?

In short, a slice is a portion of a sequence.

A slice is defined in the Python glossory.

An object usually containing a portion of a sequence. A slice is created using the subscript notation, [] with colons between numbers when several are given, such as in variable_name[1:3:5]. The bracket (subscript) notation uses slice objects internally.

Slicing is defined in the Python reference.

a slicing selects a range of items in a sequence object (e.g., a string, tuple or list). Slicings may be used as expressions or as targets in assignment or del statements. 

A slicing has three parts: start, stop, and step. The slice class is a Python builtin, so you can instantiate and use it with no imports. If you instantiante a slice object explicitly, you can supply a single value (the stop), or you can supply the start and stop and optionally, the step. In terms of what is selected in the sequence, the start is inclusive, the stop is not, and the step determines how you count from start to stop. You can use slice objects or slice notation, but unless you need to store the slice object for re-use, you usually see slice notation in code.

So let’s look at few examples to clear this up. We’ll start with a string as a sequence object.

>>> w = "abcdefg"
>>> # all of these select the same slice of w, "abc"
>>> w[slice(0, 3, 1)]  # This is a slice(start=0, stop=3, step=1)
'abc'
>>> w[0:3:1]           # This is the same, but in slice notation
'abc'
>>> w[slice(3)]        # start=0, step=1 are the defaults
'abc'
>>> w[:3]              # same here
'abc'
>>> w[:]               # null slice (:) selects the entire sequence. start=0, stop=last element (inclusive), step=1
'abcdefg'

Now slices can be used very effectively and concisely to get views of data. I’ll just cover a few examples here.

>>> w[::-1] # reverse a string (start=0, stop=end, stepping backwards)
'gfedcba'
>>> w[1:3] # a portion of a string
'bc'

Strings are special, you can’t modify them directly or delete their values. Same with tuples.

>>> try:
...     w[2] = 'z'
... except TypeError as te:
...     print(te)
'str' object does not support item assignment
>>> try:
...     t = (0, 1, 2)
...     t[0] = -1
... except TypeError as te:
...     print(te)
'tuple' object does not support item assignment

We’ll switch to using a list to show some more applications of slicing.

>>> l = [0, 1, 2, 3, 4, 5, 6]
>>> l[slice(0,3,1)]        # just as with strings, these all work on a list, and return [0, 1, 2]
[0, 1, 2]
>>> l[0:3:1]
[0, 1, 2]
>>> l[slice(3)]
[0, 1, 2]
>>> l[:3]
[0, 1, 2]
>>> l[:]                    # null slice (:) returns a full copy
[0, 1, 2, 3, 4, 5, 6]
>>> l[::2]                  # using step
[0, 2, 4, 6]
>>> l[1::2] = [-1, -1, -1]  # can also assign
>>> l
[0, -1, 2, -1, 4, -1, 6]
>>> del l[:]                # this empties the list, but it still exists

On to pandas

OK, after a quick overview of slicing, let’s move onto pandas and see why we’re talking about slicing at all. 

Just like the last post, I’ll grab a dataset from the Chicago Data Portal, but this time we’ll grab the list of Chicago public libraries.

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited
>>> df = pd.read_json("https://data.cityofchicago.org/resource/x8fc-8rcq.json")
>>> df.dtypes
name_                          object
hours_of_operation             object
address                        object
city                           object
state                          object
zip                             int64
phone                          object
website                        object
location                       object
:@computed_region_rpca_8um6     int64
:@computed_region_vrxf_vc4k     int64
:@computed_region_6mkv_f3dw     int64
:@computed_region_bdys_3d7i     int64
:@computed_region_43wa_7qmu     int64
dtype: object
>>> # trim down the columns
>>> df = df[['name_', 'hours_of_operation', 'address', 'city', 'state', 'zip', 'phone', 'website', 'location']]
>>> df.head(3)
             name_  ...                                           location
0  Vodak-East Side  ...  {'latitude': '41.70302747819179', 'longitude':...
1      Albany Park  ...  {'latitude': '41.975456', 'longitude': '-87.71...
2           Avalon  ...  {'latitude': '41.74671722160199', 'longitude':...

[3 rows x 9 columns]

Slicing in pandas

Now in pandas (and the underlying NumPy data structures), slicing is also supported, and you can use it via the indexing operator ([]), .loc, and .iloc. Just like in the last post, we’ll walk through these three methods and look at slicing for both Series and DataFrame objects. Also, just like in the last post to prevent confusion between location based indexing and label indexing, I’m going to update the index of our sample data to be string based for the initial explanations.

>>> # this will make our examples a bit more obvious
>>> df.index = "ID-" + df.index.astype(str)
>>> # Use the name_ column for a sample Series.
>>> s = df['name_']

Slicing with []

Just like basic indexing, we’ll see that slicing with .loc and .iloc is preferred to using []. Since the behavior of [] can depend on the index and arugments, it’s better to use the more explicit indexers. There’s a just a few things to keep in mind if you use it.

Series

Like standard Python, you can use integer slicing on Series objects, with start:stop:step.

>>> s[:3]            # These all slice the first 3 elements in the Series
>>> s[0:3]
>>> s[0:3:1]
>>> s[slice(0,3,1)]
ID-0    Vodak-East Side
ID-1        Albany Park
ID-2             Avalon
Name: name_, dtype: object

In a Series, we also can slice on our index, which in this case is an index of objects (strings). Note that the slice start and stop are inclusive, which is different than regular Python slicing!

>>> s['ID-4':'ID-9']
ID-4    Popular Library at Water Works
ID-5                      Little Italy
ID-6                         Edgebrook
ID-7                             Kelly
ID-8                         Edgewater
ID-9                      North Austin
Name: name_, dtype: object

You can specify a step as well.

>>> s['ID-10':'ID-15':2]
ID-10              Canaryville
ID-12                  Manning
ID-14    Whitney M. Young, Jr.
Name: name_, dtype: object

Note that your labels have to be in the index or you’ll get a KeyError.

>>> try:
...     s['ID-10':'ID-999']
... except KeyError as ke:
...     print(ke)
'ID-999'

You can also (potentially) assign/update a slice. Note that you’ll get a SettingWithCopyWarning (see below) when doing this. More on this in future posts (or read the pandas docs for details), but for now just know that when updating a subset of your data, you need to ensure you aren’t trying to update a view of it, but the actual underlying DataFrame or Series.

>>> s['ID-79':] = "Changed"
/Users/mcw/.pyenv/versions/3.8.6/envs/pandas/lib/python3.8/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
>>> s.tail()
ID-76    Woodson Regional Library
ID-77                     Mayfair
ID-78                 Chicago Bee
ID-79                     Changed
ID-80                     Changed
Name: name_, dtype: object

DataFrame

In a DataFrame, slicing with [] will slice on the rows. If you remember back to the last post, selection on a DataFrame with[]and a single argument selects the columns, so this is a bit confusing. Just remember, if you see a :, it’s slicing on rows.

>>> df[1:3]
            name_  ...                                           location
ID-1  Albany Park  ...  {'latitude': '41.975456', 'longitude': '-87.71...
ID-2       Avalon  ...  {'latitude': '41.74671722160199', 'longitude':...

[2 rows x 9 columns]

You can also use the step.

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address
>>> df['ID-3':'ID-7':2]
             name_  ...                                           location
ID-3      Brainerd  ...  {'latitude': '41.732316757097045', 'longitude'...
ID-5  Little Italy  ...  {'latitude': '41.867166', 'longitude': '-87.66...
ID-7         Kelly  ...  {'latitude': '41.78217163315653', 'longitude':...

[3 rows x 9 columns]

Slicing with .loc.

Now, just like with basic selection, .loc and .iloc are meant for label and position respectively. With .loc you cannot pass in locations unless those match the index you are using. In our case, we’ll get a TypeError. The issue is our index doesn’t contain these values, and so we’ll be rejected.

Note that with .loc, we will get both the start and stop values if they are in the index, just like with [].

Series

For our example data, you can’t slice with integers (since they aren’t in the index), but we can slice by labels.

>>> try:
...     s.loc[1:3]
... except TypeError as te:
...     print(te)
...
cannot do slice indexing on Index with these indexers [1] of type int
>>> s.loc['ID-1':'ID-3']
ID-1    Albany Park
ID-2         Avalon
ID-3       Brainerd
Name: name_, dtype: object

Assignment or updates (can) work with .loc. We’ll talk about updates in more detail in the future.

>>> s.loc['ID-79':] = "Changed again"
/Users/mcw/.pyenv/versions/pandas/lib/python3.8/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
>>> s.tail()
ID-76    Woodson Regional Library
ID-77                     Mayfair
ID-78                 Chicago Bee
ID-79               Changed again
ID-80               Changed again
Name: name_, dtype: object

DataFrame

And things are similar for a DataFrame.

>>> try:
...     df.loc[1:3]
... except TypeError as te:
...     print(te)
...
cannot do slice indexing on Index with these indexers [1] of type int
>>> df.loc['ID-1':'ID-3']
            name_  ...                                           location
ID-1  Albany Park  ...  {'latitude': '41.975456', 'longitude': '-87.71...
ID-2       Avalon  ...  {'latitude': '41.74671722160199', 'longitude':...
ID-3     Brainerd  ...  {'latitude': '41.732316757097045', 'longitude'...

[3 rows x 9 columns]

Now with a DataFrame, you can slice both on the index and the columns, by label.

>>> df.loc['ID-10':'ID-20':2, "name_":"address"]
                       name_                                 hours_of_operation            address
ID-10            Canaryville  Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....    642 W. 43rd St.
ID-12                Manning  Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....    6 S. Hoyne Ave.
ID-14  Whitney M. Young, Jr.  Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ...  415 East 79th St.
ID-16             Scottsdale  Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ...   4101 W. 79th St.
ID-18              West Lawn  Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....   4020 W. 63rd St.
ID-20            Rogers Park  Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....  6907 N. Clark St.

I’ll also point out a common idiom in pandas code to use a full slice (:) to select the entire object when not using slicing to limit selection.

>>> df['name_']           # this is one way to select a single column, for example
>>> df.loc[:,'name_']     # but this is how you select that column using .loc

Slicing with .iloc

Remember, .iloc is for strictly integer based indexing, so as you’d expect, it doesn’t work with labels. But also note that .iloc works like standard Python slicing, i.e. the stop value is not included.

Series

>>> try:
...     s.iloc["ID-1":"ID-3"]  # not this way with .iloc
... except TypeError as te:
...     print(te)
...
cannot do positional indexing on Index with these indexers [ID-1] of type str
>>> s.iloc[1:3]                # this way
ID-1    Albany Park
ID-2         Avalon
Name: name_, dtype: object

DataFrame

Similarly, DataFrame selects from the index as the first argument, and the columns with the second argument.

>>> df.iloc[1:3]
            name_  ...                                           location
ID-1  Albany Park  ...  {'latitude': '41.975456', 'longitude': '-87.71...
ID-2       Avalon  ...  {'latitude': '41.74671722160199', 'longitude':...

[2 rows x 9 columns]
>>> df.iloc[1:3, 0:2]
            name_                                 hours_of_operation
ID-1  Albany Park  Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ...
ID-2       Avalon  Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....
>>> df.iloc[1:10:3, 2:-4]
                   address     city state
ID-1   3401 W. Foster Ave.  Chicago    IL
ID-4    163 E. Pearson St.  Chicago    IL
ID-7  6151 S. Normal Blvd.  Chicago    IL

Some special cases

There’s always some special cases to consider, so I’ll go through those last. 

Changing behavior for integer based indexes

You’ll remember that I changed the index on our first test DataFrame to be a string based index instead of the standard RangeIndex of sequential integer values. This following example shows why, but you should now understand why it works this way.

>>> df2 = pd.DataFrame({"a": [0, 1, 2, 3], "b": [4, 5, 6, 7], "c": [8, 9, 10, 11]})
>>> df2
   a  b   c
0  0  4   8
1  1  5   9
2  2  6  10
3  3  7  11
>>> df2.loc[0:2]  # Remember, .loc is label based
   a  b   c
0  0  4   8
1  1  5   9
2  2  6  10
>>> df2.iloc[0:2]  # And .iloc is position based
   a  b  c
0  0  4  8
1  1  5  9
>>> df2[0:2]   # So here [] behaves like position based, even though we may be trying to use our index
   a  b  c
0  0  4  8
1  1  5  9
>>> df2.index = ["w", "x", "y", "z"]
>>> df2["w":"y"]  # now, [] behaves like it's label based.
   a  b   c
w  0  4   8
x  1  5   9
y  2  6  10

Hopefully this example reinforces for you why you’ll want to use .loc and .iloc in production code, especially when you’re taking values to select rows or columns from unknown sources!

Label slicing and sorted indexes

We saw that using .loc will include both the start and stop values of a slice, which is not the way standard Python works. Well, you can end up with indexes that are not sorted and will behave slightly differently than expected. Essentially, what you need to notice is that with a sorted index, all values in between the start and stop will be included (including the value for the stop label)

>>> s2 = pd.Series([1, 2, 3, 4, 5], index=[5, 2, 1, 9, 7])
>>> s3 = pd.Series([1, 2, 3, 4, 5], index=[0, 1, 2, 3, 4])
>>>
>>> s3.loc[0:3] # works just as you'd expect for .loc (not standard python slicing, includes end label)
0    1
1    2
2    3
3    4
dtype: int64
>>> s3.loc[0:9]  # no exception raised, even though 9 is not in the index, and includes all values between them
0    1
1    2
2    3
3    4
4    5
dtype: int64
>>> try:
...     s2.loc[0:9]   # hmmm, this won't work
... except KeyError as ke:
...     print(ke)
...
0
>>> s2.sort_index().loc[0:9] # this works though.
1    3
2    2
5    1
7    5
9    4
dtype: int64

The reason for this functionality is that it is too expensive to do the sorting first, and it may not return the values you’d expect anyway since sorting may not make sense for your index. Just be aware of this behavior when you are dealing with non-sorted indexes. You’ll realize that it’s important to understand the indexes on your data and ensure they are accurately reflecting your data and what you’re trying to accomplish with indexing and slicing it.

The Ellipsis and NumPy slicing

This is a little known part of Python, but in working through these examples I did come across the Ellipsis keyword in looking at the slicing documentation. Ellipsis is just a singleton (like None) and is intended to be used in extended slicing syntax by user-defined container data types. It is represented by the ellipsis literal, ....

The Ellipsis doesn’t appear to be supported much by pandas (beyond returning a full series when invoked like [...]). It also isn’t going to be more useful than using a full slice, .i.e. : for most cases, but it is more useful in NumPy with higher dimensional arrays.

The way to think of it is as a special value that will insert as many full slices as needed to extend the slice at the point of insertion in an index in all directions. I think this makes more sense with an example.

>>> ...
Ellipsis
>>> m = np.arange(27).reshape((3,3,3))  # 3 x 3 x 3 multi-dimensional array
>>> print("Full array:\n", m)
Full array:
 [[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]
>>> print("First element in last dimension:\n", m[:,:,0])
First element in last dimension:
 [[ 0  3  6]
 [ 9 12 15]
 [18 21 24]]
>>> print("With ellipsis:\n", m[...,0])
With ellipsis:
 [[ 0  3  6]
 [ 9 12 15]
 [18 21 24]]
>>> print("Last element in last dimension:\n", m[...,-1])
Last element in last dimension:
 [[ 2  5  8]
 [11 14 17]
 [20 23 26]]

This could come in handy if you end up dealing with highly dimensioned arrays and want to save some typing. Or if you just want to try to stump someone with some obscure Python knowledge.

Wrapping it up

OK, that should be enough for now. Hopefully you’ve learned something about slicing, I know I learned a few things in doing this writeup. 

Next let’s move on to boolean indexing, a very powerful way to select any bit of data in your Series or DataFrame using any logic you can think of.

Have anything to say about this topic?