← Stalled on Overhead Press | Moving my blog to Pelican →

Don't waste your iterators!

There's a pattern I see fairly often in code where someone uses a function that returns a sequence of some sort, filters it, and then wants to use the first result that matches the filter. It looks something like this:

return [x for x in foo if len(x) > 4][0]

This works, and looks "pythonic" (it uses list comprehensions after all!), but it's actually a fairly slow and wasteful way to get the results we want.

The actual example I saw which prompted me to post this was from a fun post by Jeff Elmore which explained creating a wu-name generator in six lines of python.

import urllib
from lxml.html import fromstring

def get_wu_name(first_name, last_name):
    """
    >>> get_wu_name("Jeff", "Elmore")
    'Ultra-Chronic Monstah'
    """

    w = urllib.urlopen("http://www.recordstore.com/wuname/wuname.pl",
                       urllib.urlencode({'fname':first_name, 'sname': last_name}))
    doc = fromstring(w.read())
    return [d for d in doc.iter('span') if d.get('class') == 'newname'][0].text

What this will do is find every span in the document, and check to see if its class is 'newname'. In order to do this, it has to scan the entire document, which may contain a significant amount of unwanted material.

We don't need a comprehensive list of matching spans. We just need one, and then we can take it and move on. With a list comprehension, we can't even ask for the first one until the whole document has been processed.

We're actually better off going through this the old-fashioned way, by using an if nested in a for-loop, and returning the result.

for d in doc.iter('span'):
    if d.get('class') == 'newname':
        return d

But we like our brevity, and python provides us with generator expressions, which look like list comprehensions, but don't do any actual work when they gets created. With them, we can ask for the first object before we even begin scanning the document, so python knows to stop processing as soon as it finds the right one. It also doesn't hold its results in memory; it passes them back as they are retrieved, one at a time. We both save the memory it would take to build up a list of spans *and* get to stop searching the moment we find a span matching our conditional.

The bad news is that we can't just write:

return (d for d in doc.iter('span') if d.get('class') == 'newname')[0].text

If we do, we get an exception:

TypeError: 'generator' object is not subscriptable

We're trying to index into a generator, but the iterator protocol doesn't support indexing to a particular member. All you can do is start at the beginning and work through it one at a time.

The good news is that since we want the first one anyway, all we have to do is start iterating through the generator, and stop after grabbing the first item.

generator = (d for d in doc.iter('span') if d.get('class') == 'newname')
   for each in generator:
       # The function gets returned on the first pass through
       # the loop, forestalling any further processing
       return each

But now we've lost the terseness of our list comprehension again. All we've done is move the if clause out of the for loop and into the generator. Not much of an improvement.

If you understand how generators do their job, though, you can actually maintain the terseness of the original list comprehension version, while enjoying the improved performance of the generator version. Each time you loop over an iterator, the next value is retrieved by calling the .next() method on the generator. So rather than relying on a forloop to go through our iterator for us, we can step through it manually using this method.

return (d for d in doc.iter('span') if d.get('class') == 'newname').next().text

In python 3 the method is called .__next__() instead of .next(), so this method isn't quite compatible across python versions. For python 3, we could use:

return (d for d in doc.iter('span') if d.get('class') == 'newname').__next__().text

If cross-python compatibility is important to you, or if you would rather not muck around with dunder methods, there is a builtin next() function which goes back at least to python 2.6, and probably further, which takes an iterator and calls the appropriate .next() or .__next__() method on the iterator, returning the result. So now we can write:

return next(d for d in doc.iter('span') if d.get('class') == 'newname').text

We've gotten the results we wanted quickly and efficiently, with no appreciable loss of code clarity.

Now, in this particular case, it's not a huge deal one way or another. Our bottleneck is going to be pulling the document down from the internet, and the page is fairly short, so we're not going to be wasting too much time scanning over it. But there are times when this trick can save you quite a bit of time. Imagine scanning through a logfile that hasn't been placed under log rotation, and has hundreds of megabytes of data in it. Imagine if the processing we were doing on each item in the list comprehension took ten minutes. Or imagine if we were calculating wu names for a million users, where every hundredth of a second getting a wu-name translated to nearly three more hours of running time. In any of these cases, knowing when to use iterators, and how to use them effectively can make a big difference.

Comments !