Removing duplicate data in Pandas

It can be very common when dealing with time series data to end up with duplicate data. This can happen for a variety of reasons, and I’ve encountered it more than one time when and tried different approaches to eliminate the duplicate values. There’s a gem of a solution on Stack Overflow and I thought it would be helpful to walk through the possible solutions to this issue.

To keep things simple, I’ll just work with a Series of floating point data. This could be anything, but we could pretend it’s something that’s manually maintained, like an earnings estimate for a stock, or a temperature reading, or a sales for a store on a given date.

>>> import pandas as pd
>>> import numpy as np
>>>
>>> items = pd.Series(np.random.random_sample(10) * 100, pd.date_range('2020-01-01', periods=10))
>>>
>>> items
2020-01-01    65.685092
2020-01-02    15.376536
2020-01-03    89.002061
2020-01-04    38.614841
2020-01-05    39.274465
2020-01-06     6.886290
2020-01-07    96.453986
2020-01-08    56.488125
2020-01-09    61.667168
2020-01-10    18.023665
Freq: D, dtype: float64

At this point, we have 10 periods of data, and the index (a DatetimeIndex with 10 days) all have unique values. But let’s say in our data, corrected data appears in the same source file. I’ll do something a bit contrived here and concatenate two Series that have some of the same dates in them, but in real life you can imagine a number of ways that data will show up in your sources with duplicated data for the same time stamp.

>>> corrected = pd.Series(np.random.random_sample(3) * 75, pd.date_range('2020-01-04', periods=3))
>>> combined = pd.concat([items, corrected])

Now, how do we get rid of this duplicated data? Let’s say that we want to only keep the most recent data in our file, assuming that it was a correction or updated value that we prefer to use. Instead of going right to the accepted solution on Stack Overflow, I’m going to work through the pandas documentation to see what the possible solutions are, and hopefully end up in the same place!

First, let’s see if we can answer the question of whether our data has duplicate items in the index. In the pandas docs, we see a few promising methods, including a duplicated method, and also a has_duplicates property. Let’s see if those report what we expect.

>>> combined.index.has_duplicates
True

Now the methods available to look at are duplicated and drop_duplicates. For duplicated, the method will return an array of boolean values, where True indicates the duplicate. You can use the keep argument to keep either the first (default) or last occurrence of the value in your index. In drop_duplicates, you get an Index returned with the duplicates already removed, and you can pass in the same keep argument with the same meaning.

>>> combined.index.duplicated(keep='last')
array([False, False, False,  True,  True,  True, False, False, False,
       False, False, False, False])
>>> combined.index.drop_duplicates(keep='last')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-07',
               '2020-01-08', '2020-01-09', '2020-01-10', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq=None)

Ok, so what do we do now with these two options? The first boolean array can be used to just pick the values that we want to keep, but the True values are the ones we want to drop. That is pretty easy, just invert it with a ~. This can be used to select the values you want out of the array, and gets us to a good solution. We need to sort the index since it is not in chronological order.

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address
>>> ~combined.index.duplicated(keep='last')
array([ True,  True,  True, False, False, False,  True,  True,  True,
        True,  True,  True,  True])
>>> combined[~combined.index.duplicated(keep='last')].sort_index()
2020-01-01    65.685092
2020-01-02    15.376536
2020-01-03    89.002061
2020-01-04    28.990089
2020-01-05    22.371910
2020-01-06    21.548301
2020-01-07    96.453986
2020-01-08    56.488125
2020-01-09    61.667168
2020-01-10    18.023665
dtype: float64

Now if we want to use the second method, drop_duplicates, we need to find a way to use that to grab the values out of our Series that we want to keep. This is a bit more complicated. First, we can use the reset_index method which is a handy way to take the index (in our case a DatetimeIndex) and turn it into a column on a DataFrame momentarily with a new regular, non-repeating index. Now you can see that our Series looks a little different, but the new column can now be used to remove duplicates.

>>> combined.reset_index()
        index          0
0  2020-01-01  65.685092
1  2020-01-02  15.376536
2  2020-01-03  89.002061
3  2020-01-04  38.614841
4  2020-01-05  39.274465
5  2020-01-06   6.886290
6  2020-01-07  96.453986
7  2020-01-08  56.488125
8  2020-01-09  61.667168
9  2020-01-10  18.023665
10 2020-01-04  28.990089
11 2020-01-05  22.371910
12 2020-01-06  21.548301

Now, we can use drop_duplicates, but we’ll use the DataFrame version of the method which has a subset argument that can be used to only consider a certain column (our new ‘index’ column) for duplicates to drop. Now since this is now a DataFrame and not a Series, we will reset the index to our index column using set_index and return the column 0. This gives us the same result as the earlier method, but in a much more roundabout way. I also find it a lot more confusing.

>>> combined.reset_index().drop_duplicates(subset='index', keep='last').set_index('index')[0].sort_index()
index
2020-01-01    65.685092
2020-01-02    15.376536
2020-01-03    89.002061
2020-01-04    28.990089
2020-01-05    22.371910
2020-01-06    21.548301
2020-01-07    96.453986
2020-01-08    56.488125
2020-01-09    61.667168
2020-01-10    18.023665
Name: 0, dtype: float64

One other way to do this is to use groupby and a grouping function (in this case the last) to select the values we want. This method provides us with sorted output and also looks simple.

>>> combined.groupby(combined.index).last()
2020-01-01    65.685092
2020-01-02    15.376536
2020-01-03    89.002061
2020-01-04    28.990089
2020-01-05    22.371910
2020-01-06    21.548301
2020-01-07    96.453986
2020-01-08    56.488125
2020-01-09    61.667168
2020-01-10    18.023665
dtype: float64

What’s the best way to do this? Like the question on Stack Overflow, I prefer the first method for readability, but the last is also pretty simple. One good argument for choosing the first method is speed.

%timeit combined[~combined.index.duplicated(keep='last')].sort_index()
282 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit combined.reset_index().drop_duplicates(subset='index', keep='last').set_index('index')[0].sort_index()
1.56 ms ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit combined.groupby(combined.index).last()
578 µs ± 5.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Well, after digging through all that, I hope you understand a bit more about how to remove duplicate items from a Series or DataFrame and why some methods might be better to choose than others.

Have anything to say about this topic?