← Getting django-south working with gis models | Psycopg2 has a web site? Sweet! →

Iterators and iterables clarified

Note

Need to fix code formatting

So what exactly is a python iterator, and how is that different from an iterable?

An iterable is an object that implements a method __iter__(), which, when called, returns an iterator. The __iter__() method can be called by the iter() function, and is also called behind the scenes in a for loop.

An iterator is a specific kind of iterable. It is an iterable which also implements a next() method, which returns a value each time it is called, until it runs out of values, at which point it raises a StopIteration exception. (And raises a StopIteration every time thereafter). Iterators also as a general rule return self when __iter__() is called.

Why does this matter?

The important thing to note here is that iterables do not have to perform their own iteration. To take a very simple case, list objects are iterables but not iterators. That is, they define an __iter__() method, but not a next() method. Rather than returning the list itself, __iter__() returns a listiterator object.

>>> names = ['Tom', 'Dick', 'Muhammad']
>>> iter(names)

This means that the list doesn't have to store state information about the iteration process. A list doesn't need to know that you just looked at the second object, so next time you iterate over it, you should see 'Muhammad'. In fact, it can't know this, because a list might be in use in two iterations at once.

Looping over an iterable twice

What if you do the following?

>>> def get_pairs(seq):
...     for x in seq:
...         for y in seq:
...             if x is not y:
...                 print(x, y)

The list would have to store its index in the x loop, but then it would have to store a different index in the y loop. It's far easier to create a separate iterator for each, that just has to handle its own loop. Just to make it clear, if verbose, the following code does approximately the same thing:

>>> seq = ['Tom', 'Dick', 'Muhammad']
>>> iterator_x = iter(seq)
>>> iterator_y = iter(seq) # First pass
>>> x = iterator_x.next() # x == 'Tom'
>>> y = iterator_y.next() # y == 'Tom'
>>> if x is not y: # x is y: skip
...    print(x, y)
>>> y = iterator_y.next() # y == 'Dick'
>>> if x is not y: # x is not y
...    print(x, y) # prints "Tom Dick"
>>> y = iterator_y.next() # y == 'Muhammad'
>>> if x is not y: # x is not y
...    print(x, y) # prints "Tom Muhammad"
>>> y = iterator_y.next()      StopIteration
>>> x = iterator_x.next() # x == 'Dick'
>>> iterator_y = iter(seq) # Restart the inner loop
>>> y = iterator_y.next() # y == 'Tom'
>>> if x is not y: # x is not y
...    print(x, y) # prints "Dick Tom"
>>> y = iterator_y.next() # y == 'Dick'
>>> if x is not y: # x is y: skip
...    print(x, y)
>>> y = iterator_y.next() # y == 'Muhammad'
>>> if x is not y: # x is not y
...    print(x, y) # prints "Dick Muhammad"
>>> y = iterator_y.next()      StopIteration
>>> x = iterator_x.next() # x == 'Muhammad'
>>> iterator_y = iter(seq) # Restart the inner loop
>>> y = iterator_y.next() # y == 'Tom'
>>> if x is not y: # x is not y
...    print(x, y) # prints "Muhammad Tom"
>>> y = iterator_y.next() # y == 'Dick'
>>> if x is not y: # x is not y
...    print(x, y) # prints "Dick Muhammad"
>>> y = iterator_y.next() # y == 'Muhammad'
>>> if x is not y: # x is y: skip
...    print(x, y)
>>> y = iterator_y.next()      StopIteration
>>> x = iterator_x.next()      StopIteration ``

Note that each loop has its own iterator. This is not the case with, for instance, file objects. File objects implement the iterator protocol directly, which is to say, they have both an __iter__() method and a next() method. As a result, you can only use a file object in one loop at a time.

Looping over an iterator twice

If we try to run the same get_pairs() function with file objects, it doesn't give us the expected results.

File 'names.txt':

Tom
Dick
Muhammad

Result of get_pairs(open('names.txt')):

Tom
 Dick

Tom
Muhammad

So two things have happened here:

  1. We're getting newlines from each line of our file.
  2. The only value being used in the x loop is 'Tom'.

The reason for (1) is pretty obvious and not particularly interesting (file lines end with newline characters), so lets look a little closer at (2). When you enter the outer loop, python calls f.__iter__(), which, for an iterator like a file object, returns a copy of the file object itself. It then calls f.next() on the iterator, and stores the value of that ('Tom\n') in x.

It then enters the inner loop, and calls f.__iter__(), returning another copy of the same file object. Now we have two loops working on copies of the same object. So when the inner loop calls f.next(), it immediately returns 'Dick\n', because *it is the same iterator* which just returned 'Tom\n' in the outer loop. The next time through, it returns 'Muhammad\n', and finally, it raises a StopIteration because it has exhausted the file, and returns control to the outer loop.

We'd like the outer loop to loop over 'Dick\n' and 'Muhammad\n' now, but remember that it is operating on the same object as the inner loop was, so when it calls f.next(), it just raises another StopIteration instead, and exits the outer loop.

Restarting an iterable

Another benefit of creating an iterable which does not perform its own iteration is that it can be restarted. Once an iterable raises a StopIteration exception, it is required to raise a StopIteration every time next() is called from then on.

An iterable, on the other hand, doesn't implement next() itself, so it never raises a StopIteration. If you try to start a new loop with your iterable, you get a new, fresh iterator, which isn't stuck in StopIteration mode.

For instance, with an iterable:

>>> names = ['Tom', 'Dick', 'Muhammad']
>>> for x in names:
...     print x  Tom  Dick  Muhammad
>>> for x in names: # Looping over a new iterator.
...     print x  Tom  Dick  Muhammad ``

And with an iterator

>>> f = open('names.txt'):
>>> for x in f:
...     print x.strip()  Tom  Dick  Muhammad
>>> for x in f:
...     print x.strip() # Immediately raises StopIteration

An iterable can be looped through over and over again, and an iterator, only once.

How can I take advantage of this?

These very different usage patterns are simply implemented by returning different things from the __iter__() method. If you create an object which returns itself, you need to also implement next(), and have it raise StopIteration once it is exhausted. If you return a different object from __iter__(), you can use it over and over again.

Note that you cannot just write an iterable and expect it to work on its own. It has to return a legitimate iterator, which has both __iter__() and next() methods. Thus writing an iterable is a little more work, but the added flexibility is often worth it.

See my follow-up post: Tricks with Iterators. In it I discuss various ways you can take advantage of iterators and iterables.

Comments !