How to remove a column from a DataFrame, with some extra detail

Removing one or more columns from a pandas DataFrame is a pretty common task, but it turns out there are a number of possible ways to perform this task. I found that this StackOverflow question, along with solutions and discussion in it raised a number of interesting topics. It is worth digging in a little bit to the details.

First, what’s the “correct” way to remove a column from a DataFrame? The standard way to do this is to think in SQL and use drop.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(25).reshape((5,5)), columns=list("abcde"))

display(df)

try:
    df.drop('b')
except KeyError as ke:
    print(ke)
    a   b   c   d   e
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
4  20  21  22  23  24
"['b'] not found in axis"

Wait, what? Why an error? That’s because the default axis that drop works with is the rows. As with many pandas methods, there’s more than one way to invoke the method (which some people find frustrating). 

You can drop rows using axis=0 or axis='rows', or using the labels argument.

df.drop(0)                # drop a row, on axis 0 or 'rows'
df.drop(0, axis=0)        # same
df.drop(0, axis='rows')   # same
df.drop(labels=0)         # same
df.drop(labels=[0])       # same
    a   b   c   d   e
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
4  20  21  22  23  24

Again, how do we drop a column?

We want to drop a column, so what does that look like? You can specify the axis or use the columns parameter.

df.drop('b', axis=1)         # drop a column
df.drop('b', axis='columns') # same
df.drop(columns='b')         # same
df.drop(columns=['b'])       # same
    a   c   d   e
0   0   2   3   4
1   5   7   8   9
2  10  12  13  14
3  15  17  18  19
4  20  22  23  24

There you go, that’s how you drop a column. Now you have to either assign to a new variable, or back to your old variable, or pass in inplace=True to make the change permanent.

df2 = df.drop('b', axis=1)

print(df2.columns)
print(df.columns)
Index(['a', 'c', 'd', 'e'], dtype='object')
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

It’s also worth noting that you can drop both rows and columns at the same time using drop by using the index and columns arguments at once, and you can pass in multiple values.

df.drop(index=[0,2], columns=['b','c'])
    a   d   e
1   5   8   9
3  15  18  19
4  20  23  24

If you didn’t have the drop method, you can basically obtain the same results through indexing. There are many ways to accomplish this, but one equivalent solution is indexing using the .loc indexer and isin, along with inverting the selection.

df.loc[~df.index.isin([0,2]), ~df.columns.isin(['b', 'c'])]
    a   d   e
1   5   8   9
3  15  18  19
4  20  23  24

If none of that makes sense to you, I would suggest reading through my series on selecting and indexing in pandas, starting here.

Back to the question

Looking back at the original question though, we see there is another available technique for removing a column.

del df['a']
df
    b   c   d   e
0   1   2   3   4
1   6   7   8   9
2  11  12  13  14
3  16  17  18  19
4  21  22  23  24

Poof! It’s gone. This is like doing a drop with inplace=True

What about attribute access?

We also know that we can use attribute access to select columns of a DataFrame.

df.b
0     1
1     6
2    11
3    16
4    21
Name: b, dtype: int64

Can we delete the column this way?

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address

del df.b
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-0dca358a6ef9> in <module>
----> 1 del df.b

AttributeError: b

We cannot. This is not an option for removing columns with the current pandas design. Is this technically impossible? How come del df['b'] works but del df.b  doesn’t?. Let’s dig into those details and see whether it would be possible to make the second work as well.

The first version works because in pandas, the DataFrame implements the __delitem__ method which gets invoked when you execute del df['b']. But what about del df.b, is there a way to handle that?

First, let’s make a simple class that shows how this works under the hood. Instead of being a real DataFrame, we’ll just use a dict as a container for our columns (which could really contain anything, we’re not doing any indexing here).

class StupidFrame:
    def __init__(self, columns):
        self.columns = columns
        
    def __delitem__(self, item):
        del self.columns[item]
        
    def __getitem__(self, item):
        return self.columns[item]
    
    def __setitem__(self, item, val):
        self.columns[item] = val
            
f = StupidFrame({'a': 1, 'b': 2, 'c': 3})
print("StupidFrame value for a:", f['a'])
print("StupidFrame columns: ", f.columns)
del f['b']
f.d = 4
print("StupidFrame columns: ", f.columns)
StupidFrame value for a: 1
StupidFrame columns:  {'a': 1, 'b': 2, 'c': 3}
StupidFrame columns:  {'a': 1, 'c': 3}

A couple of things to note here. First, we how that we can access the data in our StupidFrame with the index operators ([]), and use that for setting, getting, and deleting items. When we assigned d to our frame, it wasn’t added to our columns because it’s just a normal instance attribute. If we wanted to be able to handle the columns as attributes, we have to do a little bit more work.

So following the example from pandas (which supports attribute access of columns), we add the __getattr__ method, but we also will handle setting it with the __setattr__ method and pretend that any attribute assignment is a ‘column’. We have to update our instance dictionary (__dict__) directly to avoid an infinite recursion.

class StupidFrameAttr:
    def __init__(self, columns):
        self.__dict__['columns'] = columns
        
    def __delitem__(self, item):
        del self.__dict__['columns'][item]
        
    def __getitem__(self, item):
        return self.__dict__['columns'][item]
    
    def __setitem__(self, item, val):
        self.__dict__['columns'][item] = val
        
    def __getattr__(self, item):
        if item in self.__dict__['columns']:
            return self.__dict__['columns'][item]
        elif item == 'columns':
            return self.__dict__[item]
        else:
            raise AttributeError
    
    def __setattr__(self, item, val):
        if item != 'columns':
            self.__dict__['columns'][item] = val
        else:
            raise ValueError("Overwriting columns prohibited") 

            
f = StupidFrameAttr({'a': 1, 'b': 2, 'c': 3})
print("StupidFrameAttr value for a", f['a'])
print("StupidFrameAttr columns: ", f.columns)
del f['b']
print("StupidFrameAttr columns: ", f.columns)
print("StupidFrameAttr value for a", f.a)
f.d = 4
print("StupidFrameAttr columns: ", f.columns)
del f['d']
print("StupidFrameAttr columns: ", f.columns)
f.d = 5
print("StupidFrameAttr columns: ", f.columns)
del f.d
StupidFrameAttr value for a 1
StupidFrameAttr columns:  {'a': 1, 'b': 2, 'c': 3}
StupidFrameAttr columns:  {'a': 1, 'c': 3}
StupidFrameAttr value for a 1
StupidFrameAttr columns:  {'a': 1, 'c': 3, 'd': 4}
StupidFrameAttr columns:  {'a': 1, 'c': 3}
StupidFrameAttr columns:  {'a': 1, 'c': 3, 'd': 5}
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-fd29f59ea01e> in <module>
     39 f.d = 5
     40 print("StupidFrameAttr columns: ", f.columns)
---> 41 del f.d

AttributeError: d

How could we handle deletion?

Everything works but deletion using attribute access. We handle setting/getting columns using both the array index operator ([]) and attribute access. But what about detecting deletion? Is that possible?

One way to do this is using the __delattr__ method, which is described in the data model documentation. If you define this method in your class, it will be invoked instead of updating an instance’s attribute dictionary directly. This gives us a chance to redirect this to our columns instance.

class StupidFrameDelAttr(StupidFrameAttr):
    def __delattr__(self, item):
        # trivial implementation using the data model methods
        del self.__dict__['columns'][item]

f = StupidFrameDelAttr({'a': 1, 'b': 2, 'c': 3})
print("StupidFrameDelAttr value for a", f['a'])
print("StupidFrameDelAttr columns: ", f.columns)
del f['b']
print("StupidFrameDelAttr columns: ", f.columns)
print("StupidFrameDelAttr value for a", f.a)
f.d = 4
print("StupidFrameDelAttr columns: ", f.columns)
del f.d 
print("StupidFrameDelAttr columns: ", f.columns)
StupidFrameDelAttr value for a 1
StupidFrameDelAttr columns:  {'a': 1, 'b': 2, 'c': 3}
StupidFrameDelAttr columns:  {'a': 1, 'c': 3}
StupidFrameDelAttr value for a 1
StupidFrameDelAttr columns:  {'a': 1, 'c': 3, 'd': 4}
StupidFrameDelAttr columns:  {'a': 1, 'c': 3}

Now I’m not suggesting that attribute deletion for columns would be easy to add to pandas, but at least this shows how it could be possible. In the case of current pandas, deleting columns is best done using drop.

Also, it’s worth mentioning here that when you create a new column in pandas, you don’t assign it as an attribute. To better understand how to properly create a column, you can check out this article.

If you already knew how to drop a column in pandas, hopefully you understand a little bit more about how this works.

2 thoughts on “How to remove a column from a DataFrame, with some extra detail

Have anything to say about this topic?