Sunday, October 16, 2011

Not Your Mama's Web Server

It's a distributed machine, turkey. 

AppEngine is a funny kind of beast. It's packaged up and presented more or less as a weird kind of futuristic virtual web host; sort of like a normal web host, but the plumbing's taken care of for you. Higher level than normal web hosts, kind of annoyingly proprietary, and it works sort of weirdly when you scratch the surface; the datastore isn't quite like a sql database, the web server's not quite like a proper web server, things are just a little queer.

They're queer because it's not actually a webserver. Or, well, it is a webserver, but that's like calling a Tesla Roadster a horseless carriage. And then complaining that the reins are weird. (And making the SQL service is kind of like putting the reins in.)

What it really is, is a new kind of computer (and by new I mean 40 year old computer science, but that's in no way a criticism). A distributed one. It's made out of lots of little physical ones, but they're mostly abstracted away. What we users get to touch is a high level machine where we can run some code in some well defined and interesting ways, and where that code can talk to some very interesting global resources (memcache, datastore, the webby bits of the internet, etc).

What's really interesting to me is what is always interesting about distributed systems; massive concurrency. *Massive* concurrency. Concurrency on a scale where there aren't really specific limitations on how much stuff we can throw in in parallel; we seem to be limited largely by our wallets. And if you're a bit careful and pay a bit of attention, it's pretty cheap and that limitation is pretty minor. So that's fun.

But this concurrency isn't the within-one-process, multi-threaded computing type of concurrency. This is old school, it feels a lot more like multi-processing, processes running in their own address spaces, maybe on separate physical hardware, at a great distance from one another. By distance here I mean that the inter-process communication mechanisms are many orders of magnitude slower than the code. These are not threads, that behave like people hanging out, chatting with one another. These are more like characters in a Dickens novel, on separate continents, sending letters by sea. Charming and old world, if a little inconvenient. Again, not a criticism, this is how distributed computing is.

(Incidentally, I know that you can do multi-threaded programming on AppEngine, but I just think that's a bit boring and misses the point of the platform. Threads live inside one request, ie: process, and only live as long as that process lives. You can do some optimisation stuff with them, but they are at the wrong level of abstraction to be the real game here.)

So what are the elements of this distributed machine?

Looking at AppEngine as a Distributed Machine

Well, each user account has a bunch of program slots (10, or as many as you like for people with big wallets). Each slot, each "app", is the single-distributed-machine abstraction level. The slot/app/machine has its own resources, and lets you run one program.

Each program has a bunch of potential entry points (web request handlers). They can be used to kick off copies of the program. That can happen because of user requests, or due to scheduled task calls, or cron jobs, or backend jobs (other?). In practice, the program, having multiple entry points, is functionally a set of cooperating programs, more or less divided up by entry point (ie: web request handler).

Each time (a piece of) the program is kicked off, it gets its own address space. (Well, if you mark your programs as "threadsafe", then it might share address space with other copies of itself, but that's again an optimisation that can be ignored here.) But in any case, it's a program, loaded into an address space and being executed. We might call it a "process".

The process can do things. One special purpose thing it might do, in response to a user request, is to construct and return a web page. But it can do all kinds of other things.

Simplest among them is do to some calculation of something. Maybe you want to figure out a metric tonne of prime numbers? That sort of thing. In this case, you just do your calc and finish. Only thing to consider here is that there can be a limitation on the lifespan of a process (ie: 60 seconds, if you're using a frontend).

Anything more complex than simple calculations will need to talk to things outside of the process. Even simple calcs presumably need to go somewhere when they are complete, so you can read the results. So we need interprocess communications mechanisms. What's available?

The most obvious IPC mechanism is the datastore. We can put to the datastore, and get from the datastore. The datastore is persistent, so processes can talk to each other across time and space. The datastore is pretty slow compared to intra-process work (sea mail), although we can ameliorate that a little with memcache.

Interestingly, we probably can't reliably use memcache for this on its own, because it can expire without warning. Actually, hold that thought, there might be interprocess comms jobs to which it is suited, I'll keep an eye out for them.

The channels API could also potentially be used for interprocess communication. It's supposed to be for pushing event information to web clients, but a client could just as easily be a process inside this virtual machine. Then the channels would look like some pretty sweet message queues. So, keep that in mind too.

But the datastore is the bread and butter here.

Now there are roughly two ways of thinking about processes in this machine, two models; roughly, they are Long Lived and Short Lived.

In the Long Lived model, processes live indefinitely, so they need to communicate directly with each other. This is where the channels API could be interesting, acting like one-way pipes between unix processes. Old school message passing. They might also communicate by polling datastore objects, waiting for things to happen. This will require using backends to accomplish, so is probably a bit pricey. It also feels a little clunky; lots of sitting in loops monitoring things.

The Short Lived model is a bit more interesting, mainly because it feels more like the way the platform, the virtual distributed machine, wants to behave. In this model, the processes are short lived, and crucially never talk directly to each other. Instead, they all interact with the datastore. They are kicked off from somewhere (usually as Tasks, I'll get to that), and have arguments which, along with the entry point used, define what they are supposed to do. They read from the datastore to figure out more context, access shared resources, whatever. They perform calculations. They write back to the datastore. And they might queue up more processes (tasks) to run, as appropriate. So the processes behave as individual workers chipping away at the datastore, a shared medium, which is the coordinating point for all their work. Rather than talking directly to each other, the processes lay down cues, and notice cues layed down by others, doing their work based on those cues. It's a bit stigmergic.

Using short lived processes means we can be event driven. Programs can be designed to only be doing work when there's something which needs doing; no processes are running if there is no work to do. Combining lots of event driven pieces together mean that, at a larger scale, we can approximate the process model of traditional computers, performing work when needed and blocking when waiting for results from external sources.

To accomplish this, we first need an implementation of invokable, short lived processes. AppEngine Tasks fit this requirement. If we schedule a task to run now, we can give it arguments/context, it can run when there are resources available, it can do a short job and finish, possibly spawning more tasks. If you squint a little, this is making AppEngine's task scheduler look a little like an operating system's process scheduler.

If we use multiple tasks in concert with one another, we can emulate blocking and waiting. If a task gets to a point where it needs to wait indefinitely on an external event, we can let it quite, and arrange it so that another task can pick up where it left off when the event occurs. So our Task/Process is a bit more limited than a normal machine process, but in aggregate tasks are as powerful as we need them to be.

So in this context, we've got Processes which can do work, and which can potentially block & wait. The next thing we need is some communication primitives to let them coordinate effectively.


Ok, now I make an admission: I still own books. Not for any real information value, mind you, but for sentimental reasons. I don't own many, but I have a little bookshelf in my study, and some of my textbooks from my computer science degree (long ago and far away) sit yellowing, waiting to be called back in to service. This is as good a day as any.

Flicking through them, I've found "Principles of Concurrent and Distributed Programming", by Ben-Ari (edited by no less than C.A.R. Hoare). Flip, flip, ah here we are, Semaphores.

So can we implement a Semaphore appropriate to AppEngine?

If we model the Semaphore as a datastore object, then it needs two methods, Wait and Signal. It also needs an internal integer count (the actual semaphore value), and a queue of waiting processes.

First of all, Wait and Signal need to be atomic. How do we do that in the context of Datastore? With a transaction.

Code inside a normal transaction can work with objects in one entity group. In this case we're inside the limitations, because we only need to work with the object itself in order to implement atomicity of Wait and Signal.

The one difference between a transaction and a traditional critical section (as required by Wait and Signal) is that the transaction can fail. The simplest way to make a transaction into a critical section is to put it inside a loop. Python-like pseudocode:

  while true:
          perform transaction
      catch transaction-fails:
          # do nothing, we'll head around the loop again

  #here we succeeded

Now that could potentially starve. If this is in a front-end task, it'll time out eventually. We'll have to deal with timing out anyway, so let's leave it for that. If this is in a backend, under extreme contention it could just fail repeatedly forever. OTOH if we get into that situation, we're probably doing something spectacularly wrongly.

Now the next difficult thing is that we don't have a way to suspend a task and wake it again, continuing where it left off, when done. But, we could make something a bit crappier that'd do the job.

We could say that if you have a process that wants to wait on a semaphore, then it needs to be broken into two pieces. The first piece works up to where it waits on the semaphore. The second piece is run when it gets through wait, and must signal the semaphore at some point.

So say this is our desired process code:

  semaphore = GetTheSemaphoreFromSomewhere # I'll get to this later

Then that needs to be split to become

  semaphore = GetTheSemaphoreFromSomewhere
  semaphore.wait(EntryPoint2) <- we tell the semaphore where we need to go next

  semaphore = GetTheSemaphoreFromSomewhere

Ok. Then how do we implement the semaphore? Something like this. Pythonish pseudocode again.

class Semaphore(datastore object):
  _counter = int property
  _suspendList = entry point list property

  def Wait(self, NextEntryPoint)
    while true:
          db.run_in_transaction(_doWait, self, NextEntryPoint)
      catch transaction-fails:
          # do nothing

  def Signal(self)
    while true:
          db.run_in_transaction(_doSignal, self)
      catch transaction-fails:
          # do nothing

def _doWait(key, NextEntryPoint)
  self = db.get(key)
  if self._counter > 0:
    self._counter -= 1

def _doSignal(key)
  self = db.get_by_key(key)
  if len(self._suspendList) > 0:
    NextEntryPoint = self._suspendList.remove()
    self._counter += 1

That looks easy, doesn't it?

I haven't explained what entry points are; they could just be the relative url for a request handler. Then, the entry point list property could just be a string list property, and call(NextEntryPoint) could simply do a task.add() for that entry point.

Alternatively, entry points could be pickled function objects, kind of like what's used in deferred.defer. Actually that could be pretty sweet!

Also, where did we get the semaphore? Well, we probably want to be able to create them by name, and retrieve them by name. We could use that name as the entity key, or part of the entity key. Then you create a semaphore somewhere, and later retrieve it by looking it up by key (nice and quick).


oh man I have to stop.

In the next post, I'll include a proper semaphore implementation in python. 

I'd better get it done soon though; I'm having some philosophers to dinner.  

No comments:

Post a Comment