Basic Pandas: Renaming a DataFrame column

A very common need in working with pandas DataFrames is to rename a column. Maybe the columns were supplied by a data source like a CSV file and they need cleanup. Or maybe you just changed your mind during an interactive session. Let’s look at how you can do this, because there’s more than one way.

Let’s say we have a pandas DataFrame with several columns.

[ins] In [1]: import pandas as pd
         ...: import numpy as np
         ...:
         ...: df = pd.DataFrame(np.random.rand(5,5), columns=['A', 'B', 'C', 'D', 'E'])
         ...:
         ...: df
Out[1]:
          A         B         C         D         E
0  0.811204  0.022184  0.179873  0.705248  0.098429
1  0.905231  0.447630  0.970045  0.744982  0.566889
2  0.805913  0.569044  0.760091  0.833827  0.148091
3  0.285781  0.262952  0.250169  0.496548  0.604798
4  0.420414  0.463825  0.025779  0.287122  0.880970

What if we want to rename the columns? There is more than one way to do this, and I’ll start with an indirect answer that’s not really a rename. Sometimes your desire to rename a column is associated with a data change, so maybe you just end up adding a column instead. Depending on what you’re working on, and how much memory you can spare, and how many columns you want to deal with, adding another column is a good way to work when you’re dealing with ad-hoc exploration, because you can always step back and repeat the steps since you have the intermediate data. You can complete the rename by dropping the old column. While this isn’t very efficient, for ad-hoc data exploration, it’s quite common.

df['e'] = np.maximum(df['E'], .5)

But let’s say you do want to really just rename the column in place. Here’s an easy way, but requires you do update all the columns at once.

[ins] In [4]: print(type(df.columns))
         ...:
         ...: df.columns = ['A', 'B', 'C', 'D', 'EEEE', 'e']
<class 'pandas.core.indexes.base.Index'>

Now the columns are not just a list of strings, but rather an Index, so under the hood the DataFrame will do some work to ensure you do the right thing here.

[ins] In [5]: try:
         ...:     df.columns = ['a', 'b']
         ...: except ValueError as ve:
         ...:     print(ve)
         ...:
Length mismatch: Expected axis has 6 elements, new values have 2 elements

Now, having to set the full column list to rename just one column is not convenient, so there are other ways. First, you can use the rename method. The method takes a mapping of old to new column names, so you can rename as many as you wish. Remember, axis 0 or “index” is the primary index of the DataFrame (aka the rows), and axis 1 or “columns” is for the columns. Note that the default here is the index, so you’ll need to pass this argument.

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address
df.rename({'A': 'aaa', 'B': 'bbb', 'EEE': 'EE'}, axis="columns")

Note that by default it doesn’t complain for mappings without a match (‘EEE’ is not a column but ‘EEEE’ is in this example). You can force it to raise errors by passing in errors='raise'. Also, this method returns the modified DataFrame, so like many DataFrame methods, you need to pass inplace=True if you want to make the change persist in your DataFrame. Or you can reassign the result to the same variable.

df.rename({'A': 'aaa', 'B': 'bbb', 'EEE': 'EE'}, axis=1, inplace=True)

You can also change the columns using the set_index method, with the axis set to 1 or columns. Again,  inplace=True will update the DataFrame in place (and is the default in older versions of pandas but defaults to False in versions 1.0+) if you don’t want to reassign variables.

df.set_axis(['A', 'B', 'C', 'D', 'E', 'e'], axis="columns")

The rename method will also take a function. If you pass in the function (or dictionary) as the index or columns paramater, it will apply to that axis. This can allow you to do generic column name cleanup easily, such as removing trailing whitespace like this:

df.columns = ['A  ', 'B ', 'C  ', 'D ', 'E ', 'e']
df.rename(columns=lambda x: x.strip(), inplace=True)

I’ll also mention one of the primary reasons of not using inplace=True is for method chaining in DataFrame creation and initial setup. Often, you’ll end up doing something like this (contrived I know).

df = pd.DataFrame(np.random.rand(2,5,), columns=np.random.rand(5)).rename(columns=lambda x: str(x)[0:5])
df

Which you’ll hopefully agree is much better than this.

df = pd.DataFrame(np.random.rand(2,5,), columns=np.random.rand(5))
df.columns = [str(x)[0:5] for x in df.columns]
df

Have anything to say about this topic?