How to bulk delete entries in the App Engine's Datastore

This is a very common problem everyone new to the Datastore faces sooner rather than later: you've got a specific entity kind and you want to remove multiples entries from it. Unfortunately there's no DELETE FROM Entity instruction, so the problem is a little bit more complicated than what it seems.

There are multiple options that you'll need to evaluate to make sure you take the most appropriate way. Below, I'll try to explore each one of the possible scenarios/solutions to give you enough options to choose from:

Do you really need to delete those entries?

First of all, I want you to ask yourself if you really need to delete the information. It might be cheaper to just keep it. It might be cheaper to soft-delete it by adding an "Archived" field instead. It all depends on multiple factors, but you should spend some time thinking about it.

Deleting is expensive, especially if you are deleting entries represented in multiple indexes. The cheapest way to delete something is by leaving it alone.

So don't jump into conclusions right away, and try to think what would happen if you decide not to remove the entries at all.

Special case: Deleting all entries by hand

Before getting into more complicated procedures, if all you want to do is remove all existing entries by hand, you can use the Datastore Admin interface provided by Google:

  1. Enable the Datastore Admin
  2. Navigate to the Datastore Admin tab in the old appengine.google.com console.
  3. Select the entity kind you want to remove, and click on the Delete Entities button.

Note: At the time of this writing, the old App Engine console still exists, but Google is migrating everything to the new console. The Datastore Admin feature only exists in the old console, but that will hopefully change. I'll make sure to update this post when (if) that happens.

Deleting entries using the Remote API

If you want a little bit more flexibility than removing all entries for a given kind, you may want to consider the Remote API (Python, Java).

The Remote API provides an interactive shell for you to execute Datastore commands locally. Something like:

>>> from google.appengine.ext import db
>>> entries = Entry.all(keys_only=True)
>>> db.delete(entries)

The above code will select all keys for the Entry kind and delete them. Instead of running from a file in App Engine, the code will be running directly on your local computer, and the Remote API will take care of executing each command in the remote Datastore.

Deleting entries programmatically - The simplest approach

Of course, removing entries manually is very easy, but things start getting complicated if you want to remove them programmatically.

If you are looking at just a few entries, you may get away by doing something like this:

db.delete(Entry.all(keys_only=True))

This is exactly the same code we used with the Remote API example above, but put together in a single line. Very likely this is the simplest approach that you can follow to remove entries of a given kind in code: load the keys, then delete them.

Deleting entries in multiple batches

Unfortunately, the method described above to remove multiple entries breaks as soon as we need to get rid of a larger set of entries. The Datastore has a 30-second deadline limit, which means that we need to come up with a different solution for scenarios involving bigger data sets.

A well-understood approach to do this is to remove the data in multiple batches. For this you'd use two things: Tasks (Java, Python) and Cursors (Java, Python).

Let's say that you already identified that you can easily remove 1,000 entries at the same time without hitting the Datastore 30-second deadline. Using a cursor you can limit your Datastore queries to 1,000 entries at the time, and using a Task Queue you can distribute multiple operations over time to avoid hitting another App Engine deadline: the 60-second per request limit.

from google.appengine.ext import ndb
from google.appengine.api import taskqueue

class Task(webapp2.RequestHandler):
    def post(self): 
        entries = Entry.all(keys_only=True)

        bookmark = self.request.get('bookmark')
        if bookmark:
            cursor = 
                ndb.Cursor.from_websafe_string(bookmark)

        query = Entry.query()
        entries, next_cursor, more = query.fetch_page(
            1000, 
            keys_only=True, 
            start_cursor=cursor)

        ndb.delete_multi(entries)

        bookmark = None
        if more:
            bookmark = next_cursor.to_websafe_string()

        taskqueue.add(
            url='/task', 
            params={'bookmark': bookmark})

The above code shows the implementation of a Task that removes 1,000 entries at the time from the Datastore. Note how this task schedules itself at the end of the method passing a "bookmark" as an argument, so the next execution starts from where the previous one left off.

Deleting entries using Mapreduce

Despite the above solution works fine, if you are planning to remove large amount of entries, I'd recommend looking into Mapreduce:

MapReduce is a programming model for processing large amounts of data in a parallel and distributed fashion. It is useful for large, long-running jobs that cannot be handled within the scope of a single request (...)

By using Mapreduce, you can delegate all the plumbing to make your solution work in parallel to the framework and concentrate only on deleting the entries. The resulting code will be clearer and the processing will be executed optimally in App Engine.

Spreading the costs over time

Before finishing this post, I want to make sure you keep something in mind: deleting is an expensive operation, so depending on how many entries you want to remove, you may be looking at a very large bill at the end of the month.

One solution to this when applicable, is to spread the costs over time by removing a fixed amount of entries every day. Here you can play with the free quotas of the Datastore or simply spread the costs to avoid paying a huge one time lump sum.

As long as you don't jump to code right away and properly evaluate your options before hand, you should be fine. The Datastore is extremely powerful, but complicated to use as soon as we start manipulating large amount of data, so it requires a more careful approach to what some of us are used to.

Have something to say about this post? Get in touch!

Want to read more? Visit the archive.