Piyush BansalPiyush Bansal

Python generators and iterables

Last week, I had written a code to query a mongodb database for some tweets that were saved in it. I was working towards a project that dealt with extracting semantic information from hashtags like #IluvJKRowling and #Penisland. This is being done via a new technique that I devised and implemented that I call - Variable Length Window Sliding Hashtag Segmentation.

Anyway,the code to fetch tweets from the database looked like this -

class MongoConnect:
  def __init__(self):
    self.clients = MongoClient(mongo_server, 27017)
    self.db = self.clients.TweetStream
    self.posts = self.db.Tweets

  def find(self):
    t=self.posts.find()
    return t

  def disconnect(self):
    self.clients.close()

class MakeData:
  def __init__(self):
    self.db = MongoConnect()
    self.tweets = self.db.find()

  def yield_tweets(self):
    for tweet in self.tweets:
      yield tweet

I gave my code to someone who also wanted to work on tweets (for some sort of tweet clustering based on named entities) , and hence needed the tweet database. The question that I got asked was -

What is this yield keyword doing here ?

I realised that even though many people use python regularly, but they tend to ignore a very awesome feature of python - generators.

This post shows some of the functionalities of generators and how they can be used in Python control flow.

Generator expressions

Generators can be created with generator expressions. A generator expression is a bit like a list comprehension. List Comprehension uses square brackets []. In Python…

>>> [x*x for x in range(10) if x % 2  ]
[1, 9, 25, 49, 81]

A generator expression is a shortcut that allows generators to be created easily with a similar syntax - this time it’s using parentheses ().

>>> (x*x for x in range(10) if x % 2)
<generator object <genexpr> at 0x7f1f5b1f32d0>

Deep inside their heart, generators are iterators

Generators “provide a convenient way to implement the iterator protocol”.

In Python, an iterator provides two key functions, __iter__ and next, so a generator itself must provide these two functions:

>>> gen = (x*x for x in range(10) if x % 2)
>>> gen.__iter__
<generator object <genexpr> at 0x293c3c0>

__iter__ is there and returns the generator, now for next

>>> gen.next()
1
>>> gen.next()
9
>>> gen.next()
25
>>> gen.next()
49
>>> gen.next()
81
>>> gen.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  StopIteration

Oh, snap! What now ?

If you look carefully, our gen generator shouldn’t have any next(), because it stops when x == 9. In order to indicate that we’ve exhausted all the elments, the generator gen raises an exception - StopIteration. Now this is what all iterators do as specified by the iterator protocol.

Building a generator with yield

Although it’s not clear from the example above, a generator is able to relinquish control and return a value - while saving its state. It then allows the control to pass back to the structure that called it, until it’s called again, picking up where it left off.

This allows for loops over sets of values to be programmed, without the full list of values being calculated first. A generator can be used so that next is called before each iteration required.

In this way, only the values required for each iteration need to be computed.

Now, coming back to the question -

What does the yield keyword do ?

The yield keyword - simple example

Adding yield to a function allows for generators to be constructed ‘manually’. In a very loose sense, use it where you’d use return in a function, and would still want the method to continue sending values after returning. The benefit of using generators is that they don’t use a lot of main memory. Python is a dynamic language like Ruby, and all the loops or loop-like constructs in it are evaluated first and then the iteration is done on iterables. But consider a scenario where you would like to work with 50 gigs of tweets, but need only one tweet at a time. What would you do ? Obviously you cannot keep those 50 gigs worth of tweets in a teeny tiny python list.

A wise man would use yield, like I did (Seriously, I am a wise man) in the example we began with.

Consider another simple usage of yield keyword -

def hola():
    yield 'hello'
    yield 'hi'

Now we can make an instance of the generator.

>>> gloria_says = hola()
>>> gloria_says
<generator object hola at 0x31d0960>

And we can ask for next value.

>>> gloria.next()
'hello'

Now when we call next again, our generator continues from the state of the last yield.

>>> gloria.next()
'hi'

So you see how different values can be returned, one after the other.

And after that second thing, the generator now raises a StopIteration, since it has no further values to return.

Since a generator implements the iterator protocol, it can be used in a for statement and therefore in a list comprehension. This makes for a convenient way to check the values of a limited generator like this one.

>>> [x for x in hola()]
['hello', 'hi']

More complex example with yield

So let’s write Fibonacci as a generator. This is a very slow implementation of fibonacci, it can be done in O(logn) using something called as Matrix Exponentiation and Fast Doubling.

def fib():
    f0, f1 = 0, 1
    while True:
        yield f0
        f0, f1 = f1, f0+f1


>>> fibgen = fib()
>>> for x in range(9):
    print fibgen.next()

    0
    1
    1
    2
    3
    5
    8
    13
    21
>>>

When do you think it would raise a StopIteration exception ? The answer is never. Do you think your lists would have kept infinite fibonacci numbers in them ? I don’t think so.

Comments