It can be very common when dealing with time series data to end up with duplicate data. This can happen for a variety of reasons, and I’ve encountered it more than one time when and tried different approaches to eliminate the duplicate values. There’s a gem of a solution on Stack Overflow and I thought it would be helpful to walk through the possible solutions to this issue.
To keep things simple, I’ll just work with a
Series of floating point data. This could be anything, but we could pretend it’s something that’s manually maintained, like an earnings estimate for a stock, or a temperature reading, or a sales for a store on a given date.
>>> import pandas as pd >>> import numpy as np >>> >>> items = pd.Series(np.random.random_sample(10) * 100, pd.date_range('2020-01-01', periods=10)) >>> >>> items 2020-01-01 65.685092 2020-01-02 15.376536 2020-01-03 89.002061 2020-01-04 38.614841 2020-01-05 39.274465 2020-01-06 6.886290 2020-01-07 96.453986 2020-01-08 56.488125 2020-01-09 61.667168 2020-01-10 18.023665 Freq: D, dtype: float64
At this point, we have 10 periods of data, and the index (a
DatetimeIndex with 10 days) all have unique values. But let’s say in our data, corrected data appears in the same source file. I’ll do something a bit contrived here and concatenate two
Series that have some of the same dates in them, but in real life you can imagine a number of ways that data will show up in your sources with duplicated data for the same time stamp.
>>> corrected = pd.Series(np.random.random_sample(3) * 75, pd.date_range('2020-01-04', periods=3)) >>> combined = pd.concat([items, corrected])
Now, how do we get rid of this duplicated data? Let’s say that we want to only keep the most recent data in our file, assuming that it was a correction or updated value that we prefer to use. Instead of going right to the accepted solution on Stack Overflow, I’m going to work through the pandas documentation to see what the possible solutions are, and hopefully end up in the same place!
First, let’s see if we can answer the question of whether our data has duplicate items in the index. In the pandas docs, we see a few promising methods, including a
duplicated method, and also a
has_duplicates property. Let’s see if those report what we expect.
>>> combined.index.has_duplicates True
Now the methods available to look at are
duplicated, the method will return an array of boolean values, where
True indicates the duplicate. You can use the
keep argument to keep either the first (default) or last occurrence of the value in your index. In
drop_duplicates, you get an
Index returned with the duplicates already removed, and you can pass in the same
keep argument with the same meaning.
>>> combined.index.duplicated(keep='last') array([False, False, False, True, True, True, False, False, False, False, False, False, False]) >>> combined.index.drop_duplicates(keep='last') DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-07', '2020-01-08', '2020-01-09', '2020-01-10', '2020-01-04', '2020-01-05', '2020-01-06'], dtype='datetime64[ns]', freq=None)
Ok, so what do we do now with these two options? The first boolean array can be used to just pick the values that we want to keep, but the
True values are the ones we want to drop. That is pretty easy, just invert it with a
~. This can be used to select the values you want out of the array, and gets us to a good solution. We need to sort the index since it is not in chronological order.
>>> ~combined.index.duplicated(keep='last') array([ True, True, True, False, False, False, True, True, True, True, True, True, True]) >>> combined[~combined.index.duplicated(keep='last')].sort_index() 2020-01-01 65.685092 2020-01-02 15.376536 2020-01-03 89.002061 2020-01-04 28.990089 2020-01-05 22.371910 2020-01-06 21.548301 2020-01-07 96.453986 2020-01-08 56.488125 2020-01-09 61.667168 2020-01-10 18.023665 dtype: float64
Now if we want to use the second method,
drop_duplicates, we need to find a way to use that to grab the values out of our
Series that we want to keep. This is a bit more complicated. First, we can use the
reset_index method which is a handy way to take the index (in our case a
DatetimeIndex) and turn it into a column on a
DataFrame momentarily with a new regular, non-repeating index. Now you can see that our
Series looks a little different, but the new column can now be used to remove duplicates.
>>> combined.reset_index() index 0 0 2020-01-01 65.685092 1 2020-01-02 15.376536 2 2020-01-03 89.002061 3 2020-01-04 38.614841 4 2020-01-05 39.274465 5 2020-01-06 6.886290 6 2020-01-07 96.453986 7 2020-01-08 56.488125 8 2020-01-09 61.667168 9 2020-01-10 18.023665 10 2020-01-04 28.990089 11 2020-01-05 22.371910 12 2020-01-06 21.548301
Now, we can use
drop_duplicates, but we’ll use the
DataFrame version of the method which has a
subset argument that can be used to only consider a certain column (our new ‘index’ column) for duplicates to drop. Now since this is now a
DataFrame and not a
Series, we will reset the index to our index column using
set_index and return the column
0. This gives us the same result as the earlier method, but in a much more roundabout way. I also find it a lot more confusing.
>>> combined.reset_index().drop_duplicates(subset='index', keep='last').set_index('index').sort_index() index 2020-01-01 65.685092 2020-01-02 15.376536 2020-01-03 89.002061 2020-01-04 28.990089 2020-01-05 22.371910 2020-01-06 21.548301 2020-01-07 96.453986 2020-01-08 56.488125 2020-01-09 61.667168 2020-01-10 18.023665 Name: 0, dtype: float64
One other way to do this is to use
groupby and a grouping function (in this case the
last) to select the values we want. This method provides us with sorted output and also looks simple.
>>> combined.groupby(combined.index).last() 2020-01-01 65.685092 2020-01-02 15.376536 2020-01-03 89.002061 2020-01-04 28.990089 2020-01-05 22.371910 2020-01-06 21.548301 2020-01-07 96.453986 2020-01-08 56.488125 2020-01-09 61.667168 2020-01-10 18.023665 dtype: float64
What’s the best way to do this? Like the question on Stack Overflow, I prefer the first method for readability, but the last is also pretty simple. One good argument for choosing the first method is speed.
%timeit combined[~combined.index.duplicated(keep='last')].sort_index() 282 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit combined.reset_index().drop_duplicates(subset='index', keep='last').set_index('index').sort_index() 1.56 ms ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit combined.groupby(combined.index).last() 578 µs ± 5.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Well, after digging through all that, I hope you understand a bit more about how to remove duplicate items from a
DataFrame and why some methods might be better to choose than others.