An introduction to data modeling using Google's Datastore

I've been there just like you.

I come from the relational world like probably everyone who's been doing software for the last 5 years or so. In 2012 NoSQL databases were already a thing, and I had an opportunity to work with Google's Datastore, so I googled the documentation and started from the beginning.

Holy Molly!

A couple of paragraphs in and I realized that the Datastore is a different beast. Every person I've talked to remembers having a hard time properly understanding all the concepts, and on top of that, the existing information on the Internet is scarce and spread all over the place.

After going through all of it, I wanted something for dummies and lazybutts. For people with no time to read over and over the same paragraph and make sense out of it. For people like me, basically.

So I've organized this post in the following sections. Feel free to skip to the appropriate place whenever you want, but I greatly recommend to read the whole thing from the beginning, at least the first time:

Entities

An entity is an object in the Datastore. Think of it as something similar to a tuple (also known as "row") in a relational world. An entity represents a particular element like the following book:

Programming Google App Engine. Dan Sanderson. Paperback. Oct 26, 2012

You can have another book entity:

Python for Google App Engine. Massimiliano Pippi. 184 pages.

And another one:

Google Compute Engine. Marc Cohen. Paperback. 246 pages.

Each one of these three entities represent a book and have properties that describe them.

Entity Kinds

To bring some organization to our entities we have Entity Kinds. These represent the type of your entities. In our example above the kind is book (of course, you can name it however you like.)

(Contrary to what most people tend to think, a kind is not comparable at all to a relational table. Please, don't go that way and start thinking more about objects and less about tables.)

The kind of an entity categorizes it for the purpose of queries. For example, we already have books, and we'll later talk about publishers. Each one of these represent a different entity (object) kind (type).

Properties

The properties of an entity are just like the regular fields you already know from the relational world (or "columns", if you wish.)

Programming Google App Engine. Dan Sanderson. Paperback. Oct 26, 2012

From the above entity, we can easily discover the following properties: title, author, cover, and publication date.

Python for Google App Engine. Massimiliano Pippi. 184 pages.

This one has title, author, and number of pages.

Google Compute Engine. Marc Cohen. Paperback. 246 pages.

The last entity has title, author, cover, and number of pages.

Notice how each entity has slightly different properties. This is a key concept from the Datastore: is schemaless, meaning each entity from the same kind can have any number of different properties.

Properties have a type, like String, Integer, Date, etc. (Python, Java). They can have single or multiple values (what!?), they can be indexed, sorted, and pretty much work like you already imagine.

Multi-valued properties

People usually freak out when they hear about this, but yes, you can have multi-valued properties in the Datastore.

So far we've only talked about properties with a single value, but what happens if there's another book in our collection written by more than one author?

In a relational world we'd probably need another table and a relationship. Here we just need to make author a multi-valued property (and maybe start using the plural authors instead):

Keys

So far we've been talking about four different book entities, but how can we uniquely identify each one of these?

Just like in the relational world, every entity in the Datastore will have a key that will identify the entity.

A numeric key will be auto-generated by the Datastore if one is not provided, or we can take on the responsibility and always specify a key when a new entity is created (either numeric or string).

For our book examples, let's assume our key is the isbn of the book:

(As you'll see later in this post, the composition of a key is a little bit more complicated than what you see above, but for this example, thinking of the key as the isbn of the book is good enough.)

Relationships

Let's say we also want to keep track of the publisher for each book. How about creating entities of a new kind named publisher?

We then can go back to each book, and add a reference property to the appropriate publisher (I'm omiting the rest of the fields for clarity):

isbn: 978-1449398262 
title: Programming Google App Engine 
publisher: oreilly
...

isbn: 978-1784398194 
title: Python for Google App Engine
publisher: packt
...

isbn: 978-1449360887 
title: Google Compute Engine
publisher: oreilly
...

isbn: 978-1118824825 
title: Google BigQuery Analytics
publisher: wiley
...

(I don't think "reference property" exists as a formal concept. Both Python and Java use different terminology to represent relationships, but for our purposes, a reference property is a concept similar to a foreign key in a relational database.)

You can also create a relationship from the publisher to the book using a multi-valued reference property books:

id: oreilly
name: O'Reilly Media
website: www.oreilly.com
books: 978-1449398262, 978-1449360887

id: packt
name: Packt Publishing
website: www.packtpub.com
books: 978-1784398194

id: wiley
name: Wiley
website: www.wiley.com
books: 978-1118824825

No matter how you decide to create the relationships, here is the full illustration with all our entities:

As you can see, so far the way to create relationships is very similar to the one you already know, with the addition of the multi-valued properties which makes things a little bit more interesting.

Consistency

At this point we have a total of 7 entities in our datastore: 4 book entities and 3 publisher entities. Do you know how the Datastore will physically store this information?

Google's Datastore is all about scaling. One way to accomplish this is by sharding the information across different distributed computers. To illustrate this point, we might end up with 1 of our books living in a Data Center located in Virginia, while its publisher might be stored in San Francisco.

What happens when we query for a book together with its publisher if they are stored in different computers? As you can imagine, the Datastore needs to find a way to replicate all the information across all Data Centers so we can access it from anywhere, but this process takes time.

Changing the publisher information in San Francisco means that we need to wait for a replication process to complete before the new data is available to query from Virginia.

This causes what we know as stale results: information that's not up-to-date when retrieved from the Datastore. This is probably one of the biggest challenges when using Google's Datastore.

Entity Groups

When I first read about having to deal with stale data, I hated it so much that I closed the page and didn't want to talk about it for a week.

You are probably feeling the same way. Specially because this is probably not a problem you have to deal with in the relational world.

Sure, there are times where stale information is not a huge deal, but there are cases when you can't absolutely afford to have this problem. How can you keep using the Datastore and forget about this glaring issue?

Well, luckily for us, there's another way: when designing the structure of your data, you can tell the Datastore what information is closely related and should be stored on the same place. This way, we make sure that when returned together (check ancestor queries), you'll never get stale data.

These clusters of entities are called "Entity Groups".

So going back to our example, books should be closely related to their publishers. Anytime you return a book, it's very likely that you'd want to also returns its publisher. This means that we should create an Entity Group with book and publisher.

An Entity Group will create a parent-child relationship between both entities:

Note how we ended up with three Entity Groups. Now, anytime that the Datastore decides to shard our entities, it will always keep together members belonging to the same Entity Group.

Ancestors

In Datastore lingo, the parent-child relationship to create Entity Groups is referred to as an ancestor relationship. The parent, parent's parent, and so on are ancestors, while the children, grandchildren, and so on, are descendants.

Any entity without a parent is called a root entity.

Why am I telling you this? Well, in order to refer to any entity in the Datastore, you need to do it through it's ancestor path:

[publisher:oreilly, book:978-1449398262]
[publisher:oreilly, book:978-1449360887]
[publisher:packt, book:978-1784398194]
[publisher:wiley, book:978-1118824825]

The ancestor path is also referred as the key of an entity (Now it's probably a good time to re-read the Keys section from earlier). For a root entity, the key will be just the entity's kind and identifier:

[publisher:wiley]

Write Throughoutput

The first time I read about the Datastore I quickly realized that Entity Groups are good to avoid returning stale data, so why not to create a big Entity Group with all the entities of the system? That will make all the queries 100% consistent, right?

Not so fast.

Entity Groups have the limitation of creating datastore contention when you try to update them too rapidly:

The datastore will queue concurrent requests to wait their turn. Requests waiting in the queue past the timeout period will throw a concurrency exception. If you're expecting to update a single entity or write to an entity group more than several times per second, it's best to re-work your design early-on to avoid possible contention once your application is deployed.

So although having Entity Groups are good for consistency, they are bad for write throughoutput. You have to find the proper balance for your own application.

What comes next?

Of course, modeling for the Datastore doesn't stop here. You've also got indexes (Java, Python), and have to deal with transactions (Java, Python), and queries (Java, Python), and all sort of interesting stuff.

But hopefully this post gave you an introduction to the main concepts of the Datastore and how these work in tandem to create your data model.

If you are interested to keep going, here are some pages I'd recommend:

Happy modeling!

Have something to say about this post? Get in touch!

Want to read more? Visit the archive.