Sunday, October 23, 2011

The Dining Philosophers

Five silent philosophers sit at a table around a bowl of spaghetti. A fork is placed between each pair of adjacent philosophers.
Each philosopher must alternately think and eat. Eating is not limited by the amount of spaghetti left: assume an infinite supply. However, a philosopher can only eat while holding both the fork to the left and the fork to the right (an alternative problem formulation uses rice and chopsticks instead of spaghetti and forks).
Each philosopher can pick up an adjacent fork, when available, and put it down, when holding it. These are separate actions: forks must be picked up and put down one by one.
The problem is how to design a discipline of behavior (a concurrent algorithm) such that each philosopher won't starve, i.e. can forever continue to alternate between eating and thinking.

Ok, so I said in the previous post that AppEngine is a massive distributed machine. So on such a machine, we should be able to implement a solution to The Dining Philosophers Problem. How would we go about it?

Implementation of Semaphores

Firstly, we need a working Semaphore implementation. In the previous post I sketched out an approach, but since then I've built a proper working version.

Find the full Semaphore implementation here: on github

class Semaphore(polymodel.PolyModel):
    _counter = db.IntegerProperty()
    _suspendList = db.StringListProperty()

    def ConstructSemaphore(cls, aCounter):
        retval = cls()
        retval._counter = aCounter
        retval._suspendList = []
        return retval
    ConstructSemaphore = classmethod(ConstructSemaphore)
    def Wait(self, obj, *args, **kwargs):
        while True:
                lneedsRun = db.run_in_transaction(
                        obj, *args, **kwargs
                if lneedsRun:
                        obj(*args, **kwargs)
                    except Exception, ex:
            except TransactionFailedError:
                # do nothing
                logging.warning("TransactionFailedError in Wait, try again")

    def Signal(self):
        while True:
                db.run_in_transaction(_doSignal, self.key())
            except TransactionFailedError:
                # do nothing
                logging.warning("TransactionFailedError in Signal, try again")
def _doWait(aKey, aObj, *args, **kwargs):
    lneedsRun = False
    lsem = db.get(aKey)
    if not lsem:
        raise Exception("Internal: failed to retrieve semaphore in _doWait")
    if lsem._counter > 0:
        lsem._counter -= 1
        logging.debug("counter: %s" % lsem._counter)
        lneedsRun = True
        logging.debug("about to defer")
        pickled = deferred.serialize(aObj, *args, **kwargs)
        pickled = base64.encodestring(pickled)
        logging.debug("after defer, pickled=%s" % pickled)
    return lneedsRun

def _doSignal(aKey):
    lsem = db.get(aKey)
    if not lsem:
        raise Exception("Internal: failed to retrieve semaphore in _doSignal")

    if len(lsem._suspendList) > 0:
        logging.debug("about to unsuspend")
        pickled = lsem._suspendList.pop()
        pickled = base64.decodestring(pickled)
        logging.debug("pickled=%s" % pickled)

        # here I depickle the pickled call information, only to repickle inside 
        # deferred.defer. Not ideal, but cleaner given the calls available 
        # inside deferred.
            obj, args, kwds = pickle.loads(pickled)
        except Exception, e:
            raise deferred.PermanentTaskFailure(e)
        logging.debug("about to defer")
        deferred.defer(obj, _transactional=True, *args, **kwds)
        logging.debug("after defer")
        lsem._counter += 1

What I wrote last time was basically correct. I've using Nick Johnson's deferred.defer library to make the adding and running of tasks smoother (ie: hide the use of web handlers).

Some points of interest:

The _suspendList of Semaphore is a stringlist, which stores the same pickled call information that deferred.defer uses. I've made liberal use of internal functions of deferred to make this work, basically because that library is written by someone far more competent with Python than me, so why not?

Wait() takes obj, *args, **kwargs and passes them to the call of _doWait() in a transaction. obj, *args and **kwargs define the call to make once we have successfully acquired the semaphore. So, inside _doWait(), we'll either need to add obj (et al) to the _suspendList, or, if we can get the semaphore immediately, we need to call obj(). However, we can't call it inside _doWait() because it's inside a transaction - we don't want or need to be inside the context of that transaction for the call to obj().  So instead, _doWait returns a bool, which is True in the case that we've aquired the semaphore, and you'll see that Wait checks this result, and calls obj() immediately if it's True. Thus, obj() is called if necessary, but outside the transaction.

The pickled call information, created by deferred.serialize(), isn't safe to save in a stringlist (illegal characters, throws errors). So, I base64 encode it before saving it in _doWait(). You'll see in _doSignal that the pickled call information is base64 decoded.

You'll notice that inside _doSignal(), when there are waiting tasks on the _suspendList, that I dequeue one (the last one, but order is irrelevant in a Semaphore implementation), unpickle it, but I don't call it. Instead, I add a task for it using deferred.defer(). I don't call it because the task we are in has just finished doing work while holding the semaphore, which might have been lengthy. These are short lived tasks, we shouldn't do more than one user-defined thing in one task. So, instead of running this second thing (the dequeued suspended task), I reschedule it to run immediately in another task. Note also that I mark that defer as transactional; it means that if the signal transaction fails, the task wont be enqueued, which is what we want.

One last note: In case it's not obvious, the transactions combined with a reload of the Semaphore ensure that we can safely use Semaphore objects even if they are stale. So don't worry about stale Semaphores causing contention issues. This is explained more in the previous post.

Oh, and I think if you pass a queue name into Wait() in the same way you would to a regular call to deferred.defer() (ie: a parameter _queue="somequeuename"), the semaphore will use that queue instead of default, which might be handy.

Testing the Semaphores

I've got some simple test code in for giving Semaphores a run.

def SemaphoreTest1():"*****************************")"**   BEGIN SEMAPHORETEST1  **")"*****************************")
    lsem = Semaphore.ConstructSemaphore(2)
    lcount = 0
    while lcount < 20:
        deferred.defer(SemaphoreTest1EntryPoint, lsem.key(), lcount, True)
        lcount += 1
def SemaphoreTest1EntryPoint(aKey, aNum, aFirst):
    lsem = db.get(aKey)
    if not lsem:
        raise Exception("Failed to retrieve semaphore in EntryPoint1")

    if aFirst:
        # this is before we've got the semaphore"Before Wait for %s" % aNum)
        lsem.Wait(SemaphoreTest1EntryPoint, aKey, aNum, False)
        # we now have the semaphore"Begin Critsec for %s" % aNum)
        sleep(2) # stay inside critsec for 2 seconds"End Critsec for %s" % aNum)
        lsem.Signal() ("After Signal for %s" % aNum)

SemaphoreTest1() creates a new Semaphore with counter=2 (ie: allow max 2 tasks to be inside the Semaphore at any time), and schedules 20 tasks to run immediately, which all run SemaphoreTestEntryPoint() with the new Semaphore (passed by key in the aKey parameter).

SemaphoreTestEntryPoint() loads the Semaphore, then takes one of two paths. The first time through aFirst is True (we don't hold the semaphore yet); it waits on the semaphore (switching aFirst to False). The second time through, with aFirst as False, where we hold the Semaphore, it sleeps for a couple of seconds, then signals the Semaphore and exits.

The upshot of this is that the 20 tasks, all run at the same time, will contend for the Semaphore. The first two to get it will sit inside it for 2 seconds (a long time, while the other tasks keep waiting on it and suspending). Eventually these tasks holding it will signal it and exit. Each time a task signals, it'll kick off a waiting one, which in turn will get it, again taking two seconds, then exit. And do on until there are no more tasks left.

Try running this code, and fiddling with the parameters a bit. Note that you'll find the whole project on GitHub, here.

Back to Dinner

So we've got a working Semaphore. So how do we implement a solution to the dining philosophers?

Let's implement the classic deadlock algorithm first. It goes like this:

  • think until the left fork is available; when it is, pick it up;
  • think until the right fork is available; when it is, pick it up
  • eat
  • put the left fork down
  • put the right fork down
  • repeat from the start
First we need to represent a Fork. I'll use a Semaphore with counter=1, ie: while someone holds the fork, no one else may hold the fork:

class Fork(Semaphore.Semaphore):
    def ConstructFork(cls):
        lfork = cls.ConstructSemaphore(1)
        return lfork
    ConstructFork = classmethod(ConstructFork)

I used a polymodel for Semaphore, so we can override it as above.

Next is an implementation of the faulty algorithm above. ThinkAndEatByKey is a wrapper of ThinkAndEat, which takes Forks by key rather than by object reference, loads them, and calls through. The real work happens in ThinkAndEat.

def ThinkAndEatByKey(aFirstForkKey, aSecondForkKey, aIndex, aNumLoops, aHasFirst, aHasSecond):
    lFirstFork = db.get(aFirstForkKey)
    if not lFirstFork:
        raise Exception("Failed to retrieve Left Fork")
    lSecondFork = db.get(aSecondForkKey)
    if not lSecondFork:
        raise Exception("Failed to retrieve Right Fork")
    ThinkAndEat(lFirstFork, lSecondFork, aIndex, aNumLoops, aHasFirst, aHasSecond)
def ThinkAndEat(aFirstFork, aSecondFork, aIndex, aNumLoops, aHasFirst=False, aHasSecond=False):
    if not aHasFirst:
        # this is before we've got the semaphore"Wait on first for %s" % aIndex)
        aFirstFork.Wait(ThinkAndEatByKey, aFirstFork.key(), aSecondFork.key(), aIndex, aNumLoops, True, False)
    elif not aHasSecond:
        sleep(10) # takes a while to pick up the second fork!"Wait on second for %s" % aIndex)
        aSecondFork.Wait(ThinkAndEatByKey, aFirstFork.key(), aSecondFork.key(), aIndex, aNumLoops, True, True)
    else:"EAT for %s" % aIndex)"Dropping second fork for %s" % aIndex)
        aSecondFork.Signal()"Dropping first fork for %s" % aIndex)
        if aNumLoops == 1:
  "Finished looping, done.")
  "Ready to think again, deferring")

ThinkAndEat has three states. First, if we have neither Semaphore then aHasFirst and aHasSecond are false (I use First and Second instead of Left and Right for a bit of leeway later on). In this case, we wait on the first fork, and aHasFirst, aHasSecond will be True/False on the next call. This is the next case, where we then wait on the the second fork, and aHasFirst, aHasSecond will both be True on the third call. Finally, when they are both true, we have both forks. So we Eat (just log something, but this could be a lengthy op of some kind), then Signal, ie: drop, both forks.

Finally, we reschedule ThinkAndEat again to complete the loop.

You'll note in the second state I've added a sleep for 10 seconds. That is, between picking up the first fork and picking up the second, our philosophers think for a really long time. This doesn't change the theoretical behaviour of the algorithm, but in practice it makes it very easy to deadlock especially on the first iteration; everyone will pick up the first fork, there'll be a pause, then everyone will try to pick up the second fork and have to wait indefinitely.

Note that I use a countdown, aNumLoops, to stop this running forever. Eventually, in my house, even the Philosophers need to finish and go home!

Now, to finish implementing the algorithm above, we need to create all the forks, then call ThinkAndEat for each philosopher, passing in the correct forks.

def DiningPhilosphersFailTest():
    lnumPhilosophers = 5
    lnumLoops = 5 # number of think/eat loops
    leta = datetime.datetime.utcnow() + datetime.timedelta(seconds=20)
    lforks = []
    lforkIndex = 0
    while lforkIndex < lnumPhilosophers:
        lfork = Fork.ConstructFork()
        lforkIndex += 1
    lphilosopherIndex = 0
    while lphilosopherIndex < lnumPhilosophers:
            lforks[(lphilosopherIndex+1) % lnumPhilosophers],
            _eta = leta
        lphilosopherIndex += 1

This method sets up 5 philosophers with 5 forks, who will perform the ThinkAndEat loop 5 times. Philosopher i (i from 0 to 4) gets left fork i, right fork i+1, except for the philosopher 4, who gets left fork 4, right fork 0 (ie: a round table).

When you run this method, it tends to deadlock every time. You can watch the default task list until the queued tasks goes to zero, then go to the datastore and run this query:

select * from Fork where _counter = 0

You should get a bunch of results; each one is a Fork (ie: Semaphore) which has had a successful Wait(), but no comparable Signal(). For this to be the case when no tasks are running requires that a set of philosophers (all of them in fact) have one fork and are waiting for another. Deadlock.

Now to fix this, Dijkstra (who originally posed this problem) proposed a strict ordering of resources (forks) along with a rule that philosophers acquire forks only in resource order. So, if we use the fork numbering above, and say we must acquire lower numbered forks before higher numbered, then we have the same algorithm *except* that philosopher 4 must now change to acquire his right fork first (fork 0) followed by his left fork (fork 4).

We can achieve this simply by swapping forks, passing fork 0 as FirstFork and fork 4 as SecondFork. This is of course why I used First and Second rather than Left and Right.

So here's the non-deadlocking solution:

def DiningPhilosphersSucceedTest():
    lnumPhilosophers = 5
    lnumLoops = 5 # number of think/eat loops
    leta = datetime.datetime.utcnow() + datetime.timedelta(seconds=20)
    lforks = []
    lforkIndex = 0
    while lforkIndex < lnumPhilosophers:
        lfork = Fork.ConstructFork()
        lforkIndex += 1
    lphilosopherIndex = 0
    while lphilosopherIndex < lnumPhilosophers:
        if lphilosopherIndex < lnumPhilosophers-1:
            # not the last one
                _eta = leta
            # the last one
                _eta = leta
        lphilosopherIndex += 1

If you run this you wont get deadlock. You should find that when all tasks complete (the task queue is empty), "select * from Fork where _counter = 0" will give you zero objects.

You can safely run either of these solutions with fewer or much larger numbers of Philosophers, and more or less loops. To get deadlock with the first solution, you'll need to ensure tasks can run together. I find this takes two things; you need a large bucket size and replenishment rate on your default task queue, and you need to give it a decent number of loops, to give AppEngine time to spin up enough instances to run a lot of parallel tasks. To deadlock, you'll need to be able to run all Philosopher tasks concurrently; if they run sequentially they'll resolve out and fail to deadlock. Remember that the deadlocking algorithm isn't guaranteed to deadlock, that's only a possibility. The more philosophers, the harder it'll be to see it in practice (although you should see a lot of contention in the logs).

One more note: You can't see this deadlock with the development appserver. Why not? Because the development server runs tasks sequentially. The Semaphores and Tasks will work, but in a very uninteresting one-task-at-a-time way.

Lifetime issues with Semaphores

You'll notice that all these tests leave Semaphore objects lying around afterwards.

Classically, semaphores are in-memory structures, and require memory management techniques for lifetime management. We have an analogous cleanup requirement for these Datastore backed Semaphores.

You'll need some way to know when you are finished with them, so you can then delete them. For some uses it may be clear (eg: if they are always associated with another resource, then you create them when that resource is created and delete them when that resource is deleted). Other times, you may know that they should only live a certain length of time, so for example you could add a time-to-live or a delete-me-after field to them, and have a background process cleaning up every so often.

Very interesting, but why do I care? 

AppEngine lends itself to all kinds of uses where we are managing access to resources. Either we are managing something limited (with a semaphore with a counter equal to the number of resources available), or we are managing contention over something which can only be touched by one task (with a semaphore with counter=1, otherwise known as a Mutex).

In my app Syyncc, I used a dumb cron loop to move changes from one social network to another. Because changes move in multiple directions, and the central mechanism isn't concurrency safe, I needed to use a single processing loop to ensure I was only working on one thing at a time. It was a brain dead approach to controlling a concurrency issue.

But it doesn't scale. It'll process 50 items at a time, once every 2 minutes. That's severely limited.

Instead, I intend to model each user's combination of connected networks (fb, G+, etc) as a "hub", which is protected by a Mutex (Semaphore with Count=1). I can have separate tasks (Workers) monitoring each network, and driving changes to the others, which don't communicate except that they share the hub Mutex, so only one can run at once. So, if there are changes in a user's G+ and Facebook at the same time, processing those changes will be concurrency safe, with the critical shared pieces protected.

Why not just use transactions?

Because transactions only protect the datastore, and enqueuing of tasks.

Resources under contention can be all kinds of things. In the case of Syyncc, they involve reading and writing the datastore *and* reading and writing social networks. A model of managing contention is better here than transactions (simply because the latter aren't available!)

And, I want to make a State Machine

A project I've got coming up involves starting and stopping virtual machines in AWS. To do that well turns out to be involved, and really I want a full State Machine, where I can define a state graph, with nodes, conditions, transitions, ooh. And to build that, I'll need Semaphores.

Expect to see an article on implementation of a Semaphore-based State Machine in the near future.

Sunday, October 16, 2011

Not Your Mama's Web Server

It's a distributed machine, turkey. 

AppEngine is a funny kind of beast. It's packaged up and presented more or less as a weird kind of futuristic virtual web host; sort of like a normal web host, but the plumbing's taken care of for you. Higher level than normal web hosts, kind of annoyingly proprietary, and it works sort of weirdly when you scratch the surface; the datastore isn't quite like a sql database, the web server's not quite like a proper web server, things are just a little queer.

They're queer because it's not actually a webserver. Or, well, it is a webserver, but that's like calling a Tesla Roadster a horseless carriage. And then complaining that the reins are weird. (And making the SQL service is kind of like putting the reins in.)

What it really is, is a new kind of computer (and by new I mean 40 year old computer science, but that's in no way a criticism). A distributed one. It's made out of lots of little physical ones, but they're mostly abstracted away. What we users get to touch is a high level machine where we can run some code in some well defined and interesting ways, and where that code can talk to some very interesting global resources (memcache, datastore, the webby bits of the internet, etc).

What's really interesting to me is what is always interesting about distributed systems; massive concurrency. *Massive* concurrency. Concurrency on a scale where there aren't really specific limitations on how much stuff we can throw in in parallel; we seem to be limited largely by our wallets. And if you're a bit careful and pay a bit of attention, it's pretty cheap and that limitation is pretty minor. So that's fun.

But this concurrency isn't the within-one-process, multi-threaded computing type of concurrency. This is old school, it feels a lot more like multi-processing, processes running in their own address spaces, maybe on separate physical hardware, at a great distance from one another. By distance here I mean that the inter-process communication mechanisms are many orders of magnitude slower than the code. These are not threads, that behave like people hanging out, chatting with one another. These are more like characters in a Dickens novel, on separate continents, sending letters by sea. Charming and old world, if a little inconvenient. Again, not a criticism, this is how distributed computing is.

(Incidentally, I know that you can do multi-threaded programming on AppEngine, but I just think that's a bit boring and misses the point of the platform. Threads live inside one request, ie: process, and only live as long as that process lives. You can do some optimisation stuff with them, but they are at the wrong level of abstraction to be the real game here.)

So what are the elements of this distributed machine?

Looking at AppEngine as a Distributed Machine

Well, each user account has a bunch of program slots (10, or as many as you like for people with big wallets). Each slot, each "app", is the single-distributed-machine abstraction level. The slot/app/machine has its own resources, and lets you run one program.

Each program has a bunch of potential entry points (web request handlers). They can be used to kick off copies of the program. That can happen because of user requests, or due to scheduled task calls, or cron jobs, or backend jobs (other?). In practice, the program, having multiple entry points, is functionally a set of cooperating programs, more or less divided up by entry point (ie: web request handler).

Each time (a piece of) the program is kicked off, it gets its own address space. (Well, if you mark your programs as "threadsafe", then it might share address space with other copies of itself, but that's again an optimisation that can be ignored here.) But in any case, it's a program, loaded into an address space and being executed. We might call it a "process".

The process can do things. One special purpose thing it might do, in response to a user request, is to construct and return a web page. But it can do all kinds of other things.

Simplest among them is do to some calculation of something. Maybe you want to figure out a metric tonne of prime numbers? That sort of thing. In this case, you just do your calc and finish. Only thing to consider here is that there can be a limitation on the lifespan of a process (ie: 60 seconds, if you're using a frontend).

Anything more complex than simple calculations will need to talk to things outside of the process. Even simple calcs presumably need to go somewhere when they are complete, so you can read the results. So we need interprocess communications mechanisms. What's available?

The most obvious IPC mechanism is the datastore. We can put to the datastore, and get from the datastore. The datastore is persistent, so processes can talk to each other across time and space. The datastore is pretty slow compared to intra-process work (sea mail), although we can ameliorate that a little with memcache.

Interestingly, we probably can't reliably use memcache for this on its own, because it can expire without warning. Actually, hold that thought, there might be interprocess comms jobs to which it is suited, I'll keep an eye out for them.

The channels API could also potentially be used for interprocess communication. It's supposed to be for pushing event information to web clients, but a client could just as easily be a process inside this virtual machine. Then the channels would look like some pretty sweet message queues. So, keep that in mind too.

But the datastore is the bread and butter here.

Now there are roughly two ways of thinking about processes in this machine, two models; roughly, they are Long Lived and Short Lived.

In the Long Lived model, processes live indefinitely, so they need to communicate directly with each other. This is where the channels API could be interesting, acting like one-way pipes between unix processes. Old school message passing. They might also communicate by polling datastore objects, waiting for things to happen. This will require using backends to accomplish, so is probably a bit pricey. It also feels a little clunky; lots of sitting in loops monitoring things.

The Short Lived model is a bit more interesting, mainly because it feels more like the way the platform, the virtual distributed machine, wants to behave. In this model, the processes are short lived, and crucially never talk directly to each other. Instead, they all interact with the datastore. They are kicked off from somewhere (usually as Tasks, I'll get to that), and have arguments which, along with the entry point used, define what they are supposed to do. They read from the datastore to figure out more context, access shared resources, whatever. They perform calculations. They write back to the datastore. And they might queue up more processes (tasks) to run, as appropriate. So the processes behave as individual workers chipping away at the datastore, a shared medium, which is the coordinating point for all their work. Rather than talking directly to each other, the processes lay down cues, and notice cues layed down by others, doing their work based on those cues. It's a bit stigmergic.

Using short lived processes means we can be event driven. Programs can be designed to only be doing work when there's something which needs doing; no processes are running if there is no work to do. Combining lots of event driven pieces together mean that, at a larger scale, we can approximate the process model of traditional computers, performing work when needed and blocking when waiting for results from external sources.

To accomplish this, we first need an implementation of invokable, short lived processes. AppEngine Tasks fit this requirement. If we schedule a task to run now, we can give it arguments/context, it can run when there are resources available, it can do a short job and finish, possibly spawning more tasks. If you squint a little, this is making AppEngine's task scheduler look a little like an operating system's process scheduler.

If we use multiple tasks in concert with one another, we can emulate blocking and waiting. If a task gets to a point where it needs to wait indefinitely on an external event, we can let it quite, and arrange it so that another task can pick up where it left off when the event occurs. So our Task/Process is a bit more limited than a normal machine process, but in aggregate tasks are as powerful as we need them to be.

So in this context, we've got Processes which can do work, and which can potentially block & wait. The next thing we need is some communication primitives to let them coordinate effectively.


Ok, now I make an admission: I still own books. Not for any real information value, mind you, but for sentimental reasons. I don't own many, but I have a little bookshelf in my study, and some of my textbooks from my computer science degree (long ago and far away) sit yellowing, waiting to be called back in to service. This is as good a day as any.

Flicking through them, I've found "Principles of Concurrent and Distributed Programming", by Ben-Ari (edited by no less than C.A.R. Hoare). Flip, flip, ah here we are, Semaphores.

So can we implement a Semaphore appropriate to AppEngine?

If we model the Semaphore as a datastore object, then it needs two methods, Wait and Signal. It also needs an internal integer count (the actual semaphore value), and a queue of waiting processes.

First of all, Wait and Signal need to be atomic. How do we do that in the context of Datastore? With a transaction.

Code inside a normal transaction can work with objects in one entity group. In this case we're inside the limitations, because we only need to work with the object itself in order to implement atomicity of Wait and Signal.

The one difference between a transaction and a traditional critical section (as required by Wait and Signal) is that the transaction can fail. The simplest way to make a transaction into a critical section is to put it inside a loop. Python-like pseudocode:

  while true:
          perform transaction
      catch transaction-fails:
          # do nothing, we'll head around the loop again

  #here we succeeded

Now that could potentially starve. If this is in a front-end task, it'll time out eventually. We'll have to deal with timing out anyway, so let's leave it for that. If this is in a backend, under extreme contention it could just fail repeatedly forever. OTOH if we get into that situation, we're probably doing something spectacularly wrongly.

Now the next difficult thing is that we don't have a way to suspend a task and wake it again, continuing where it left off, when done. But, we could make something a bit crappier that'd do the job.

We could say that if you have a process that wants to wait on a semaphore, then it needs to be broken into two pieces. The first piece works up to where it waits on the semaphore. The second piece is run when it gets through wait, and must signal the semaphore at some point.

So say this is our desired process code:

  semaphore = GetTheSemaphoreFromSomewhere # I'll get to this later

Then that needs to be split to become

  semaphore = GetTheSemaphoreFromSomewhere
  semaphore.wait(EntryPoint2) <- we tell the semaphore where we need to go next

  semaphore = GetTheSemaphoreFromSomewhere

Ok. Then how do we implement the semaphore? Something like this. Pythonish pseudocode again.

class Semaphore(datastore object):
  _counter = int property
  _suspendList = entry point list property

  def Wait(self, NextEntryPoint)
    while true:
          db.run_in_transaction(_doWait, self, NextEntryPoint)
      catch transaction-fails:
          # do nothing

  def Signal(self)
    while true:
          db.run_in_transaction(_doSignal, self)
      catch transaction-fails:
          # do nothing

def _doWait(key, NextEntryPoint)
  self = db.get(key)
  if self._counter > 0:
    self._counter -= 1

def _doSignal(key)
  self = db.get_by_key(key)
  if len(self._suspendList) > 0:
    NextEntryPoint = self._suspendList.remove()
    self._counter += 1

That looks easy, doesn't it?

I haven't explained what entry points are; they could just be the relative url for a request handler. Then, the entry point list property could just be a string list property, and call(NextEntryPoint) could simply do a task.add() for that entry point.

Alternatively, entry points could be pickled function objects, kind of like what's used in deferred.defer. Actually that could be pretty sweet!

Also, where did we get the semaphore? Well, we probably want to be able to create them by name, and retrieve them by name. We could use that name as the entity key, or part of the entity key. Then you create a semaphore somewhere, and later retrieve it by looking it up by key (nice and quick).


oh man I have to stop.

In the next post, I'll include a proper semaphore implementation in python. 

I'd better get it done soon though; I'm having some philosophers to dinner.  

Saturday, October 15, 2011

Multi-threaded Python 2.7 WTFAQ?

No, I said multi-threaded... ah close enough.

I'm just beginning my first experiments with python 2.7 apps, using "threadsafe: true". But I'm a clueless n00b as far as python goes. Well, not a n00b, but still a beginner. And then this multi-threading thing turns up, and I find myself groaning "oh man, really, does it have to get this complex?" I think I hear a lot of similar groans out
there ;-)

I'm betting that the whole "multithreaded" thing in python appengine apps is scaring plenty of people. I've done a lot of concurrent programming, but the prospect of dealing with threading in python has daunted me a bit because I'm a beginner with python and appengine as it is - this just makes life harder. But hey, it's being added for a reason; I'd best quit complaining and start figuring it out!

Thinking about threads and python, I realised that I didn't know how I needed to actually use multi-threading to make my apps leaner and meaner. I mean, why would I use them? They're for doing inherently concurrent things. Serving up pages isn't inherently concurrent stuff, at the app development level. What exactly is expected here? Shouldn't the framework be doing that kind of thing for me?

And of course that was the aha moment. The framework *is* doing the work for me.

The situation with python appengine development up until now has been that instances process serially. They take a request, see it through to its end. They take another request. And so on. That's cool, but instances spend a lot of time sitting around waiting when they could be doing more work.

But with the new python 2.7 support, you can tell appengine that it would be ok to give instances more work when they are blocked waiting for something. eg: if they are doing a big url fetch, or a long query from datastore, something like that, then it's cool to give them another request to begin working on, and come back to the waiting request later when it's ready. You do that by setting "threadsafe: true" in your app.yaml .

Being threadsafe sounds scary! But actually it shouldn't be a huge deal. Pretty much it's about what you shouldn't do, and largely you're probably not doing it anyway.


  • I drove off a cliff and was trapped in my car for the last couple of weeks, surviving on old sauce packets and some pickles that were on the floor. So I'm a bit out of the loop. WTFAQ are you talking about?
  • Threads are when my socks are wearing out and there are dangly bits. Multithreading is when they are really worn out. Right? 
    • Multi-threading means having multiple points of execution on the one codebase in the one address space. You can do some really cool stuff with threads. Or you can safely ignore them.
  • But there is some minimal stuff I should be paying attention to, right?
    • Yup. What you need to know is how to support Concurrent Requests.
  • Concurrent what now?
    • Concurrent Requests means that your instances can serve multiple requests at a time, instead of just one at a time. You'll be paying for those instances. So this should be a bit cheaper.
  • I like money.
    • yes, ok.
  • Ok, so what do I do to get these durn newfangled concurrent whatsits?
    • It's easy. Just follow these steps:
      • You'll be using Python 2.7
      • To use Python 2.7 you have to use the High Replication Datastore. If your app has been around for a while (from before there was a choice of datastore type) then you might be using the Master/Slave datastore. If so, you need to migrate. If you think that's you, then read this:
      • Read this, but don't let it freak you out:
      • Also glance over the new Getting Started sample, it's a bit different.
      • If you got this far and haven't read any of the links above, congratulations. RTFM is for girly men (of all genders). 
      • Figure out if your app is going to be ok:
        • Calls to memcache, datastore, other services of AppEngine, are fine.
        • urlfetch and other httpish stuff (urllib, urllib2?) is fine.
        • Normal code touching local variables is fine.
        • Don't mess with instance memory (unless you know what you're doing). Mostly you can only use it for caching anyway; if you're not already doing that, don't worry about it. Basically, this means staying away from global variables. Multiple requests can come in and fiddle with those globals at the same time, Which Can Be Bad.
        • Libraries included by AppEngine are fine, or else you'll get "don't use this" warnings. So don't worry too much here. But do check this link for changes to libraries with Python 2.7, some of that might be relevant to you.
        • You didn't read that, did you? You are Rock & Roll incarnate.
        • Some of your third party libraries might be messing with global memory, and not be threadsafe. You know that shady date library you scored in a back alley on That might be a problem. Read the code, ask around, or just give it a shot and flag the fact that it might blow up in your face.
      • Rewrite your or equivalent to use WSGI script handlers. That means it should look like this and not like this
      • Set up your App.yaml properly; change "runtime: python" to "runtime: python27" and add "threadsafe: true". Like this:
        application: helloworld
        version: 1
        runtime: python27
        api_version: 1
        threadsafe: true
        - url: /.*
      • Make sure to get the latest appengine sdk; v1.5.5 or later. You can't actually run with threadsafe:true in the dev appserver yet, but you need at least this version or it'll refuse to upload.
  • So I can't run this stuff on the dev appserver?
    • Nope. Just set "threadsafe: false" when running locally. That's a bit annoying, but I'm sure it'll be sorted out soon.
  • Damn, that list of stuff is tl;dr. Do I have to do this?
    • Nope. In fact, it's early days and you'll be heading into experimental land if you do it. If it's totally weirding you out and you have better things to do with your life, just ignore this whole thing for a bit. Eventually, way later on, it'll become properly supported, and then probably compulsory, but by then there'll be better guides, better understanding in the community, all that. It's totally fair to let the crazy nerds race out and crash on the new features, then skate in past the fallen bodies like Steven Bradbury: 

Wednesday, October 12, 2011

Go Spiny Norman, Go!

Update 14 Oct 2011: The green line from my graphs below has just appeared in the AppEngine admin console's Instances graph, as "Billing". Well actually it's the minimum of my green line and the blue "Total Instances" line, ie: it defines what I show as the pink area. Anyway, that's really, really useful, thanks to the AppEngine team from us developers!


tl;dr version: You do *not* have to minimise your app's instance count in order to keep your appengine costs in an acceptable range. What you do have to do is to set Max Idle Instances to a low number. If you do that, your app can cost you about the same as it costs you now, possibly less. You don't need multithreading in order to achieve decent pricing. You will see advice like this: "Forget about the scheduler. Turn on multithreading ASAP". That is wrong.


It looks like the Spiny Norman Test was successful.

Recall the hypothesis:

Hypothesis: Ignoring the 15 minute cost for spinning up new instances, the price we pay should be the pink area on the graph. That is, the moment by moment minimum of (total instances) and (active instances + Max Idle Instances). If Max Idle Instances is Automatic, then there is no green line, and we pay for the area under the blue line.

I proposed 

1 - First test that we pay for the area under the blue line when Max Idle Instances is Automatic.
2 - Next, test that we pay for the pink area when Max Idle Instances is set to something.

I ran two tests. Firstly, I ran Spiny Norman with Max Idle Instances set to Automatic. Secondly, I ran Spiny Norman with Max Idle Instances set to 2.

Test 1: Max Idle Instances set to Automatic

Here's the instance graph for the day:

The area under the blue line looks to be roughly 7 * 20 = 140 hours. The billing says:

Frontend Instance Hours

Looks about right! So if you leave Max Idle Instances set to Automatic, you'll pay for the entire blue area.

Test 2: Max Idle Instances set to 2

Here's the instance graph for the day of this test:

That, believe it or not, is an identical run of Spiny Norman. Why the huge blowout of instances in the second part of the day? No idea, maybe the Max Idle Instances = 2 setting caused the scheduler to get upset? In any case, here we care about the yellow line, active instances, not the blue line. Here's a modified graph with the predicted "pink area" as above, for yellow line + 2:

Spiny Norman uses negligible instance time (very low Active line), he just causes total (idle) instances to blow out. So it looks like that area is around, say, 0+2 instances * 20 hours? So approx 40 hours.

And the billing says:

Frontend Instance Hours

That's about right!


I could do some more in depth tests, changing the behaviour of Spiny Norman over the course of a day, playing with the Max Idle Instances setting, but I think these two tests show the state of play pretty adequately. Hypothesis supported.

So what this means, especially for those of us watching our pennies, is that even though we can't stop the scheduler kicking off massive amounts of instances, we can control whether we pay for them or not. Make sure that you set Max Idle Instances to a fixed (low!) number. For my self funded projects I'll be setting it to 1, and that'll do.

Leaving Max Idle Instances on Automatic, the default, is a mistake you'll regret very, very quickly.

Of course the billing rules will probably change tomorrow. Ah well.

Monday, October 10, 2011

An Embarrassment of Riches

Update: Details on new stuff in Python 2.7 for appengine here:

AppEngine is all colour and movement at the moment.

  • MySQL compatible db layer, Google Cloud SQL. 
  • Python 2.7 now available as an experimental runtime for all apps. 
  • Cross group (XG) transactions. 
  • Increased limits for all kinds of cool stuff.

First, Cloud SQL is released. Yo Dawg, we heard you like databases, so we put a database in your database so you can query while you query. Or, more clearly, this:

Thursday, October 6, 2011

Google Cloud SQL: Your database in the cloud

Cross-posted from the Google Code Blog

One of App Engine’s most requested features has been a simple way to develop traditional database-driven applications. In response to your feedback, we’re happy to announce the limited preview of Google Cloud SQL. You can now choose to power your App Engine applications with a familiar relational database in a fully-managed cloud environment. This allows you to focus on developing your applications and services, free from the chores of managing, maintaining and administering relational databases. Google Cloud SQL brings many benefits to the App Engine community:
  • No maintenance or administration - we manage the database for you.
  • High reliability and availability - your data is replicated synchronously to multiple data centers. Machine, rack and data center failures are handled automatically to minimize end-user impact.
  • Familiar MySQL database environment with JDBC support (for Java-based App Engine applications) and DB-API support (for Python-based App Engine applications).
  • Comprehensive user interface for administering databases.
  • Simple and powerful integration with Google App Engine.
The service includes database import and export functionality, so you can move your existing MySQL databases to the cloud and use them with App Engine. Cloud SQL is available free of charge for now, and we will publish pricing at least 30 days before charging for it. The service will continue to evolve as we work out the kinks during the preview, but let us know if you’d like to take it for a spin. 

Here's the FAQ:
And the Group:!forum/google-cloud-sql-discuss 

The signup form implied that we'll be able to mix and match SQL and datastore, which is damned fine.

Then, the prerelease of the new SDK is announced by Ikai Lan:

Hey everyone,

Prerelease SDK 1.5.5 is now available for download! You can get it here:



We provide prerelease SDKs as previews for things to come. New features should not work in production yet, and documentation is typically still a work in progress. Release notes are below as well as in the prerelease packages:

- Python 2.7 is now available as an experimental runtime for all applications
  using the High Replication Datastore. To upload your app to the Python 2.7
  runtime, change the runtime argument in your app.yaml to python27.
- We have released an experimental utility, available in the Admin Console, to
  assist in migrating your application to the High Replication datastore. This
  utility allows you to copy the bulk of your data in the background, while the
  source application is still serving. You then need a brief read-only period to
  migrate your application data while you copy the data that has changed from
  the time the original copy started.
- We have increased the number of files you can upload with your application
  from 3,000 to 10,000.
- We have increased the size limit for a single file uploaded to App Engine from
  10MB to 32MB.
- We have increased the Frontend request deadline from 30 seconds to 60 seconds.
- We have increased the URLFetch maximum deadline from 10 seconds to 60 seconds.
- We have increased the URLFetch Post payload from 1MB to 5MB.
- App Engine now supports Cross Group (XG) transactions with the High
  Replication Datastore, which allow you to perform transactions across
  multiple entity groups.
- We have released an experimental API that can write to Google Storage for
  Developers directly from App Engine.
- We have added a graph to the admin console that displays the number of
  instances for which you will be billed.
- In the XMPP API, get_presence() is deprecated in favor of using the inbound
  presence handlers documented in
- The Task Queue API 'target' parameter now accepts a new value,
  taskqueue.DEFAULT_APP_VERSION, which will send the task to the default
  frontend version, rather than the version or backend where the 'add' method is
  being called.
- In the URLFetch API, make_fetch_call() now returns an RPC object.
- Fixed an issue in the Admin Console where the "Run Now" button did not work
  for tasks with a '-' in the name.
- Fixed an issue where the SDK did not decode Base64 encoded blobs.
- Fixed an issue to provide a better error message when using the Mail API to
  send email to an invalid user address.
- Fixed an issue in the SDK where a skip_files entry caused an ImportError when
  the library was located elsewhere in the PYTHONPATH.
- Fixed an issue in the SDK index viewer where the arrows indicating whether a
  query was ascending or descending were not properly rendered.
- Fixed an issue where httplib did not support the deadline argument for
  URLFetch calls.
- Fixed an issue where you could not schedule a cron job to run every 100
- Fixed an issue in the SDK where failed tasks retried immediately instead of
  waiting for 30 seconds.
- Fixed an issue making it possible to modify request headers using the deferred

- We have released an experimental utility, available in the Admin Console, to
  assist in migrating your application to the High Replication datastore. This
  utility allows you to copy the bulk of your data in the background, while the
  source application is still serving. You then need to take a short downtime to
  migrate your application data while you copy the data that has changed from
  the time the original copy started.
- We have increased the number of files you can upload with your application to
  from 3,000 to 10,000.
- We have increased the size limit for a single file uploaded to App Engine from
  10MB to 32MB.
- We have increased the Frontend request deadline from 30 seconds to 60 seconds.
- We have increased the URLFetch maximum deadline from 10 seconds to 60 seconds.
- We have increased the URLFetch Post payload from 1MB to 5MB.
- App Engine now supports Cross Group (XG) transactions with the High
  Replication Datastore, which allow you to perform transactions across multiple
  entity groups.
- We have released an experimental API that can write to Google Storage for
  Developers directly from App Engine.
- We have added a graph to the admin console that displays the number of
  instances for which you will be billed.
- In the XMPP API, getPresence() is deprecated in favor of using the inbound
  presence handlers documented in
- Fixed an issue in the Admin Console where the "Run Now" button did not work
  for tasks with a '-' in the name.
- Fixed an issue to provide a better error message when a user tries to parse an
  HttpRequest's input stream more than once in a request.
- Fixed an issue to provide a better error message when using the Mail API to
  send email to an invalid user address.
- Fixed an issue in the SDK where HttpServletRequest.getInputStream().read()
  always returned -1.
- Fixed an issue where you could not schedule a cron job to run every 100

Ikai Lan
Developer Programs Engineer, Google App Engine

The Spiny Norman Test

Update: Results are in, see Go Spiny Norman, Go.

Previously, in The Amazing Story of AppEngine and the Two Orders Of Magnitude, I've written about minimizing the cost of instances in the new AppEngine billing regime. But I think I made a mistake, and I think many people are making the same mistake.

Here's one of the graphs that I showed of instance usage from my appengine app Syyncc:

My posts were largely about trying to drop the blue line down (that's "Total" instances), and I largely ignored the yellow line, "Active" instances.

Now to get that blue line down, I did two things. I first set Max Idle Instances to 1, from Automatic. That is detailed here, and was successful in dropping the blue line down. Next, I changed my app's task behaviour, from kicking off 50 tasks every 2 mins, to smoothing those out, scheduling one every two seconds.

Once I got my billing results, these changes made a huge impact.  But, the numbers were puzzling. Firstly, they were too low (which I just accepted happily, as these numbers represent money in my pocket). Secondly, it appeared that all the benefit was seen based on the first change (Max Idle Instances), with no change from the smoothing out of tasks. That's been bugging me.

And then on the AppEngine list, Gerald Tan made this comment:

The reason why your Frontend Instance hours are lower than you expected is because you assumed that you will be billed for the area under the BLUE line in the Instance graph. It's not. You are being billed for the area under the YELLOW line (Active Instance) PLUS your Max Idle Instance setting. So your Active Instances is hovering at around ~0.72, and I assume you have set your application's Max Idle Instance to 1. Therefore ~1.72 * 24 = ~41.28 Instance Hours

Oh really?? That would match the data, very cool. And why were the numbers so high before I set the Max Idle Instances to 1?

This Post-Preview Pricing FAQ (should have been called a Primer for the alliteration) says some unclear things. We have this:

"Instances are charged for their uptime in addition to a 15-minute startup fee, the startup fee covers what it takes for App Engine to bring up and down the instance. So, if you have an on-demand instance only serving traffic for 5 minutes, you will pay for 5+15 minutes, or $0.08 / 60 * 20 = 2.6 cents. Additionally, if the instance stops and then starts again within a 15 minute window, the startup fee will only be charged once and the instance will be considered "up" for the time that passed. For example, if an on-demand instance is serving traffic for 5 min, is then down for 4 minutes and then serving traffic for 3 more minutes, you will pay for (5+4+3)+15 minutes, or $0.08 / 60 * 27 = 3.6 cents."

On the other hand, this:

Max Idle Instances: Decreasing this value will likely decrease your bill as fewer idle instances will typically be running and we will not charge for any excessive idle instances. In this case the scheduler knob is a suggestion to the scheduler but we will not charge you for excess if the scheduler ignores the suggestion. For instance, if you set Max Idle Instances to 5 and the scheduler leaves 16 instances up for some length of time, you will only be charged for 5 instances.

So, I think this might mean the following:

If you set "Max Idle Instances" to Automatic (the default setting), that means you are letting the scheduler spend your money. It'll keep as many instances running at any time as it thinks you need, and you'll pay for all of them (plus that nasty 15 minute bonus on starting up extras). This means, you pay for the area under the blue line.

If you set "Max Idle Instances" to a specific value, you'll pay for your active instance time plus your "Max Idle Instances" setting, or your Total instance time, whichever is less. ie: you pay for the minimum of (area under yellow line + Max Idle Instances) and (area under blue line).

So setting Max Idle Instances to an actual number is a good idea. The lower you set it, the more it might affect the scheduler's decisions, but still, to minimise cost, set it to a finite number.

Great conjectures. But then, the old lady in my head (oh god she's really in there) says this:


Ok old lady, I'll test it. razza frazza rackkin testin frazza razza....


Ok, so first we need an hypothesis. Put a new line on the graph, a green line, which is the yellow line, raised up by the setting of Max Idle Instances. If Max Idle Instances is 3, it'll look like this:

The pink area is the intersection of the area under the blue line and the area under the green line.

Hypothesis: Ignoring the 15 minute cost for spinning up new instances, the price we pay should be the pink area on the graph. That is, the moment by moment minimum of (total instances) and (active instances + Max Idle Instances). If Max Idle Instances is Automatic, then there is no green line, and we pay for the area under the blue line.

So how do we test that hypothesis?

1 - First test that we pay for the area under the blue line when Max Idle Instances is Automatic.
2 - Next, test that we pay for the pink area when Max Idle Instances is set to something.

To get a good test here, we want to create an instance usage profile where the blue line and the yellow line are disparate. My best guess for how to do this is to create some spiky usage, that should leave too many instances running most of the time.

Enter Spiny Norman!

Spiny Norman is a Worker class, designed to do one thing; cause AppEngine to experience very bursty load.

import logging

from Worker import Worker
from datetime import timedelta
from google.appengine.ext import db

class SpinyNorman(Worker):
    _minutesBetweenSpines = 12
    _spineWidth = 1000000
    _numberOfSpinesRemaining = db.IntegerProperty()
    def CreateSpines(cls, aSpineLength, aNumberOfSpines):
        lcount = 0
        while lcount < aSpineLength:
            lnorman = SpinyNorman()
            lnorman._numberOfSpinesRemaining = aNumberOfSpines
            lnorman.enabled = True
            lcount += 1
    CreateSpines = classmethod(CreateSpines)

    def doExecute(self):
        self._numberOfSpinesRemaining -= 1
        lcount = 0
        while lcount < self._spineWidth:
            lcount += 1
    def doCalculateNextRun(self, aUtcNow, alastDue):
        if self._numberOfSpinesRemaining > 0:
            if alastDue:
                return alastDue + timedelta(minutes=self._minutesBetweenSpines)
                return aUtcNow + timedelta(minutes=self._minutesBetweenSpines)
            return None # time to stop
Spiny Norman creates a spiny workload, as follows:

Each "Spine" is a set of tasks running (doExecute()) at the same time. The length of the spine is the number of tasks. The width of the spine (in time) is a measure of how much work the spine will do (how long it'll work for). Spines are set apart from each other in time, which is the minutes between spines. There are a fixed number of spines.

You kick off Spiny Norman by calling SpinyNorman.CreateSpines(spineLength, numberOfSpines) . That creates a number of instances of Spiny Norman equal to the spineLength, and sets the countdown for how many iterations they should continue for (numberOfSpines). _spineWidth is the number of times to sit in a busy loop in doExecute.  _minutesBetweenSpines is used to calculate the next run time in doCalculateNextRun.

I'm using a spine length of 250 (that is, 250 tasks), a spine width of 1,000,000 (enough load to notice some work being done, a few second's worth), 12 minutes between spines and 100 spines total (ie Spiny Norman runs for about 1200 minutes, or 20 hours, total).

I've set up a new AppEngine instance, I've enabled billing, and I've kicked off Spiny Norman to run during his own billing day. I've left Max Idle Instances set to Automatic. We should see a huge difference between the blue and yellow instance lines, and the billing should tell us which one I'm paying for, which will test part 1 of the hypothesis.

In the next post I'll publish the result of this test, and I'll kick off the next test. Stay tuned!