Drop duplicates in order

Let’s say you have a list containing all the URLs extracted from a web page and you want to get rid of duplicate URLs.

The most common way of achieving that might be building a set from that list, given that such operation automatically drops the duplicates. Something like:

>>> urls = [
    'http://api.example.com/b',
    'http://api.example.com/a',
    'http://api.example.com/c',
    'http://api.example.com/b'
]
>>> set(urls)
{'http://api.example.com/a',
 'http://api.example.com/b',
 'http://api.example.com/c'}

The problem is that we just lost the original order of the list.

A good way to maintain the original order of the elements after removing the duplicates is by using this trick with collections.OrderedDict:

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(urls).keys())
['http://api.example.com/b',
 'http://api.example.com/a',
 'http://api.example.com/c']

Cool, huh? Now let’s dig into details to understand what the code above does.

OrderedDict is like a traditional Python dict with a (not so) slight difference: OrderedDict keeps the elements’ insertion order internally. This way, when we iterate over such an object, it will return its elements in the order in which they’ve been inserted.

Now, let’s breakdown the operations to understand what’s going on:

>>> odict = OrderedDict.fromkeys(urls)

The fromkeys() method creates a dictionary using the values passed as its first parameters as the keys and the second parameter as its values (or None if we pass nothing, as we did).

As a result we get:

>>> odict
OrderedDict([('http://api.example.com/b', None),
             ('http://api.example.com/a', None),
             ('http://api.example.com/c', None)])

Now that we have a dictionary with the URLs as the keys, we can call the keys() method to get only a sequence containing the URLs:

>>> list(odict.keys())
['http://api.example.com/b',
 'http://api.example.com/a',
 'http://api.example.com/c']

Easy like that. 🙂

If you enjoyed this tip, subscribe to the blog, because I’ll be posting more content in the upcoming weeks.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s