Sunday, January 6, 2013

gaedocstore: JSON Document Database Layer for ndb


In my professional life I'm working on a server side appengine based system whose next iteration needs to be really good at dealing with schema-less data; JSON objects, in practical terms. To that end I've thrown together a simple document database layer to sit on top of appengine's ndb, in python.

Here's the github repo: https://github.com/emlynoregan/gaedocstore

And here's the doco as it currently exists in the repo, it should explain what I'm up to.

This library will no doubt change as begins to be used in earnest.


gaedocstore

gaedocstore is MIT licensed http://opensource.org/licenses/MIT
gaedocstore is a lightweight document database implementation that sits on top of ndb in google appengine.

Introduction

If you are using appengine for your platform, but you need to store arbitrary (data defined) entities, rather than pre-defined schema based entities, then gaedocstore can help.
gaedocstore takes arbitrary JSON object structures, and stores them to a single ndb datastore object called GDSDocument.
In ndb, JSON can simply be stored in a JSON property. Unfortunately that is a blob, and so unindexed. This library stores the bulk of the document in first class expando properties, which are indexed, and only resorts to JSON blobs where it can't be helped (and where you are unlikely to want to search anyway).
gaedocstore also provides a method for denormalised linking of objects; that is, inserting one document into another based on a reference key, and keeping the inserted, denormalised copy up to date as the source document changes. Amongst other uses, this allows you to provide performant REST apis in which objects are decorated with related information, without the penalty of secondary lookups.

Simple Put

When JSON is stored to the document store, it is converted to a GDSDocument object (an Expando model subclass) as follows:
  • Say we are storing an object called Input.
  • Input must be a dictionary.
  • Input must include a key at minimum. If no key is provided, the put is rejected.
    • If the key already exists for a GDSDocument, then that object is updated using the new JSON.
    • With an update, you can indicate "Replace" or "Update" (default is Replace). Replace entirely replaces the existing entity. "Update" merges the entity with the existing stored entity, preferentially including information from the new JSON.
    • If the key doesn't already exist, then a new GDSDocument is created for that key.
  • The top level dict is mapped to the GDSDocument (which is an expando).
  • The GDSDocument property structure is built recursively to match the JSON object structure.
    • Simple values become simple property values
    • Arrays of simple values become a repeated GenericProperty. ie: you can search on the contents.
    • Arrays which include dicts or arrays become JSON in a GDSJson object, which just hold "json", a JsonProperty (nothing inside is indexed, or searchable)
    • Dictionaries become another GDSDocument
    • So nested dictionary fields are fully indexed and searchable, including where their values are lists of simple types, but anything inside a complex array is not.
eg:
ldictPerson = {
    "key": "897654",
    "type": "Person",
    "name": "Fred",
    "address": 
    {
        "addr1": "1 thing st",
        "city": "stuffville",
        "zipcode": 54321,
        "tags": ['some', 'tags']
    }
}

lperson = GDSDocument.ConstructFromDict(ldictPerson)
lperson.put()    
This will create a new person. If a GDSDocument with key "897654" already existed then this will overwrite it. If you'd like to instead merge over the top of an existing GDSDocument, you can use aReplace = False, eg:
    lperson = GDSDocument.ConstructFromDict(lperson, aReplace = False)

Simple Get

All GDSDocument objects have a top level key. Normal ndb.get is used to get objects by their key.

Querying

Normal ndb querying can be used on the GDSDocument entities. It is recommended that different types of data (eg Person, Address) are denoted using a top level attribute "type". This is only a recommended convention however, and is in no way required.
You can query on properties in the GDSDocument, ie: properties from the original JSON.
Querying based on properties in nested dictionaries is fully supported.
eg: Say I store the following JSON:
{
    "key": "897654",
    "type": "Person",
    "name": "Fred",
    "address": 
    {
        "key": "1234567",
        "type": "Address",
        "addr1": "1 thing st",
        "city": "stuffville",
        "zipcode": 54321
    }
}
A query that would return potentially multiple objects including this one is:
GDSDocument.gql("WHERE address.zipcode = 54321").fetch()
or
s = GenericProperty()
s._name = 'address.zipcode'
GDSDocument.query(s == 54321).fetch()
Note that if you are querying on properties below the top level, you cannot do the more standard
GDSDocument.query(GenericProperty('address.zipcode') == 54321).fetch()  # fails
If you need to get the json back from a GDSDocument, just do this:
json = lgdsDocument.to_dict()

Denormalized Object Linking

You can directly support denormalized object linking.
Say you have two entities, an Address:
{
    "key": "1234567",
    "type": "Address",
    "addr1": "1 thing st",
    "city": "stuffville",
    "zipcode": 54321
}
and a Person:
{
    "key": "897654",
    "type": "Person",
    "name": "Fred"
    "address": // put the address with key "1234567" here
}
You'd like to store the Person so the correct linked address is there; not just the key, but the values (type, addr1, city, zipcode).
If you store the Person as:
{
    "key": "897654",
    "type": "Person",
    "name": "Fred",
    "address": {"key": "1234567"}
}
then this will automatically be expanded to
{
    "key": "897654",
    "type": "Person",
    "name": "Fred",
    "address": 
    {
        "key": "1234567",
        "type": "Address",
        "addr1": "1 thing st",
        "city": "stuffville",
        "zipcode": 54321
    }
}
Furthermore, gaedocstore will update these values if you change address. So if address changes to:
{
    "key": "1234567",
    "type": "Address",
    "addr1": "2 thing st",
    "city": "somewheretown",
    "zipcode": 12345
}
then the person will automatically update to
{
    "key": "897654",
    "type": "Person",
    "name": "Fred",
    "address": 
    {
        "key": "1234567",
        "addr1": "2 thing st",
        "city": "somewheretown",
        "zipcode": 12345
    }
}
Denormalized Object Linking also supports pybOTL transform templates. gaedocstore can take a list of "name", "transform" pairs. When a key appears like
{
    ...
    "something": { key: XXX },
    ...
}
then gaedocstore loads the key referenced. If found, it looks in its list of transform names. If it finds one, it applies that transform to the loaded object, and puts the output into the stored GDSDocument. If no transform was found, then the entire object is put into the stored GDSDocument as described above.
eg:
Say we have the transform "address" as follows:
ltransform = {
    "fulladdr": "{{.addr1}}, {{.city}} {{.zipcode}}"
}
You can store this transform against the name "address" for gaedocstore to find as follows:
GDSDocument.StorebOTLTransform("address", ltransform)
Then when Person above is stored, it'll have its address placed inline as follows:
{
    "key": "897654",
    "type": "Person",
    "name": "Fred",
    "address": 
    {
        "key": "1234567",
        "fulladdr": "2 thing st, somewheretown 12345"
    }
}
An analogous process happens to embedded addresses whenever the Address object is updated.
You can lookup the bOTL Transform with:
ltransform = GDSDocument.GetbOTLTransform("address")
and delete it with
GDSDocument.DeletebOTLTransform("address")
Desired feature (not yet implemented): If the template itself is updated, then all objects affected by that template are also updated.

Deletion

If an object is deleted, then all denormalized links will be updated with a special key "link_missing": True. For example, say we delete address "1234567" . Then Person will become:
{
    "key": "897654",
    "type": "Person",
    "name": "Fred",
    "address": 
    {
        "key": "1234567",
        "link_missing": True
    }
}
And if the object is recreated in the future, then that linked data will be reinstated as expected.
Similarly, if an object is saved with a link, but the linked object can't be found, "link_missing": True will be included as above.

updating denormalized linked data back to parents

The current version does not support this, but in a future version we may support the ability to change the denormalized information, and have it flow back to the original object. eg: you could change addr1 in address inside person, and it would fix the source address. Note this wont work when transforms are being used (you would need inverse transforms).

storing deltas

I've had a feature request from a friend, to have a mode that stores a version history of all changes to objects. I think it's a great idea. I'd like a strongly parsimonious feel for the library as a whole: it should just feel like "ndb with benefits").

No comments:

Post a Comment