A. Jesse Jiryu Davis

Category: Mongo

Wasp's Nest: The Read-Copy-Update Pattern In Python

In recent work on PyMongo, I used a concurrency-control pattern that solves a variety of reader-writer problem without mutexes. It's similar to the read-copy-update technique used extensively in the Linux kernel. I'm dubbing it the [...]

In recent work on PyMongo, I used a concurrency-control pattern that solves a variety of reader-writer problem without mutexes. It's similar to the read-copy-update technique used extensively in the Linux kernel. I'm dubbing it the Wasp's Nest. Stick with me—by the end of this post you'll know a neat concurrency pattern, and have a good understanding of how PyMongo handles replica set failovers.

Update: In this post's first version I didn't know how close my code is to "ready-copy-update". Robert Moore schooled me in the comments. I also named it "a lock-free concurrency pattern" and Steve Baptiste pointed out that I was using the term wrong. My algorithm merely solves a race condition without adding a mutex, it's not lock-free. I love this about blogging: in exchange for a little humility I get a serious education.


Paper Wasp © MzePhotos.com, Some Rights Reserved

The Mission

MongoDB is deployed in "replica sets" of identical database servers. A replica set has one primary server and several read-only secondary servers. Over time a replica set's state can change. For example, if the primary's cooling fans fail and it bursts into flames, a secondary takes over as primary a few seconds later. Or a sysadmin can add another server to the set, and once it's synced up it becomes a new secondary.

I help maintain PyMongo, the Python driver for MongoDB. Its MongoReplicaSetClient is charged with connecting to the members of a set and knowing when the set changes state. Replica sets and PyMongo must avoid any single points of failure in the face of unreliable servers and networks—we must never assume any particular members of the set are available.

Consider this very simplified sketch of a MongoReplicaSetClient:

class Member(object):
    """Represents one server in the set."""
    def __init__(self, pool):
        # The connection pool.
        self.pool = pool

class MongoReplicaSetClient(object):
    def __init__(self, seeds):
        self.primary = None
        self.members = {}
        self.refresh()

        # The monitor calls refresh() every 30 sec.
        self.monitor = MonitorThread(self)

    def refresh(self):
        # If we're already connected, use our list of known
        # members. Otherwise use the passed-in list of
        # possible members, the 'seeds'.
        seeds = self.members.keys() or self.seeds

        # Try seeds until first success.
        ismaster_response = None
        for seed in seeds:
            try:
                # The 'ismaster' command gets info
                # about the whole set.
                ismaster_response = call_ismaster(seed)
                break
            except socket.error:
                # Host down / unresolvable, try the next.
                pass

        if not ismaster_response:
            raise ConnectionFailure()

        # Now we can discover the whole replica set.
        for host in ismaster_response['hosts']:
            pool = ConnectionPool(host)
            member = Member(pool)
            self.members[host] = member

        # Remove down members from dict.
        for host in self.members.keys():
            if host not in ismaster_response['hosts']:
                self.members.pop(host)

        self.primary = ismaster_response.get('primary')

    def send_message(self, message):
        # Send an 'insert', 'update', or 'delete'
        # message to the primary.
        if not self.primary:
            self.refresh()

        member = self.members[self.primary]
        pool = member.pool
        try:
            send_message_with_pool(message, pool)
        except socket.error:
            self.primary = None
            raise AutoReconnect()

We don't know which members will be available when our application starts, so we pass a "seed list" of hostnames to the MongoReplicaSetClient. In refresh, the client tries them all until it can connect to one and run the isMaster command, which returns information about all the members in the replica set. The client then makes a connection-pool for each member and records which one is the primary.

Once refresh finishes, the client starts a MonitorThread which calls refresh again every 30 seconds. This ensures that if we add a secondary to the set it will be discovered soon and participate in load-balancing. If a secondary goes down, refresh removes it from self.members. In send_message, if we discover the primary's down, we raise an error and clear self.primary so we'll call refresh the next time send_message runs.

The Bugs

PyMongo 2.1 through 2.5 had two classes of concurrency bugs: race conditions and thundering herds.

The race condition is easy to see. Look at the expression self.members[self.primary] in send_message. If the monitor thread runs refresh and pops a member from self.members while an application thread is executing the dictionary lookup, the latter could get a KeyError. Indeed, that is exactly the bug report we received that prompted my whole investigation and this blog post.

The other bug causes a big waste of effort. Let's say the primary server bursts into flames. The client gets a socket error and clears self.primary. Then a bunch of application threads all call send_message at once. They all find that self.primary is None, and all call refresh. This is a duplication of work that only one thread need do. Depending how many processes and threads we have, it has the potential to create a connection storm in our replica set as a bunch of heavily-loaded applications lurch to the new primary. It also compounds the race condition because many threads are all modifying the shared state. I'm calling this duplicated work a thundering herd problem, although the official definition of thundering herd is a bit different.

Fixing With A Mutex

We know how to fix race conditions: let's add a mutex! We could lock around the whole body of refresh, and lock around the expression self.members[self.primary] in send_message. No thread sees members and primary in a half-updated state.

...and why it's not ideal

This solution has two problems. The first is minor: the slight cost of acquiring and releasing a lock for every message sent to MongoDB, especially since it means only one thread can run that section of send_message at a time. A reader-writer lock alleviates the contention by allowing many threads to run send_message as long as no thread is running refresh, in exchange for greater complexity and cost for the single-threaded case.

The worse problem is the behavior such a mutex would cause in a very heavily multithreaded application. While one thread is running refresh, all threads running send_message will queue on the mutex. If the load is heavy enough our application could fail while waiting for refresh, or could overwhelm MongoDB once they're all simultaneously unblocked. Better under most circumstances for send_message to fail fast, saying "I don't know who the primary is, and I'm not going to wait for refresh to tell me." Failing fast raises more errors but keeps the queues small.

The Wasp's Nest Pattern

There's a better way, one that requires no locks, is less error-prone, and fixes the thundering-herd problem too. Here's what I did for PyMongo 2.5.1, which we'll release next week.

First, all information about the replica set's state is pulled out of MongoReplicaSetClient and put into an RSState object:

class RSState(object):
    def __init__(self, members, primary):
        self.members = members
        self.primary = primary

MongoReplicaSetClient gets one RSState instance that it puts in self.rsstate. This instance is immutable: no thread is allowed to change the contents, only to make a modified copy. So if the primary goes down, refresh doesn't just set primary to None and pop its hostname from the members dict. Instead, it makes a deep copy of the RSState, and updates the copy. Finally, it replaces the old self.rsstate with the new one.

Each of the RSState's attributes must be immutable and cloneable, too, which requires a very different mindset. For example, I'd been tracking each member's ping time using a 5-sample moving average and updating it with a new sample like so:

class Member(object):
    def add_sample(self, ping_time):
        self.samples = self.samples[-4:]
        self.samples.append(ping_time)
        self.avg_ping = sum(self.samples) / len(self.samples)

But if Member is immutable, then adding a sample means cloning the whole Member and updating it. Like this:

class Member(object):
    def clone_with_sample(self, ping_time):
        # Make a new copy of 'samples'
        samples = self.samples[-4:] + [ping_time]
        return Member(samples)

Any method that needs to access self.rsstate more than once must protect itself against the state being replaced concurrently. It has to make a local copy of the reference. So the racy expression in send_message becomes:

rsstate = self.rsstate  # Copy reference.
member = rsstate.members[rsstate.primary]

Since the rsstate cannot be modified by another thread, send_message knows its local reference to the state is safe to read.

A few summers ago I was on a Zen retreat in a rural house. We had paper wasps building nests under the eaves. The wasps make their paper from a combination of chewed-up plant fiber and saliva. The nest hangs from a single skinny petiole. It's precarious, but it seems to protect the nest from ants who want to crawl in and eat the larvae. The queen periodically spreads an ant-repellant secretion around the petiole; its slenderness conserves her ant-repellant, and concentrates it in a small area.

Wasp's Nest in Situ [Source]

I think of the RSState like a wasp's nest: it's an intricate structure hanging off the MongoReplicaSetClient by a single attribute, self.rsstate. The slenderness of the connection protects send_message from race conditions, just as the thin petiole protects the nest from ants.

Since I was fixing the race condition I fixed the thundering herd as well. Only one thread should run refresh after a primary goes down, and all other threads should benefit from its labor. I nominated the monitor to be that one thread:

class MonitorThread(threading.Thread):
    def __init__(self, client):
        threading.Thread.__init__(self)
        self.client = weakref.proxy(client)
        self.event = threading.Event()
        self.refreshed = threading.Event()

    def schedule_refresh(self):
        """Refresh immediately."""
        self.refreshed.clear()
        self.event.set()

    def wait_for_refresh(self, timeout_seconds):
        """Block until refresh completes."""
        self.refreshed.wait(timeout_seconds)

    def run(self):
        while True:
            self.event.wait(timeout=30)
            self.event.clear()

            try:
                try:
                    self.client.refresh()
                finally:
                    self.refreshed.set()
            except AutoReconnect:
                pass
            except:
                # Client was garbage-collected.
                break

(The weakref proxy prevents a reference cycle and lets the thread die when the client is deleted. The weird try-finally syntax is necessary in Python 2.4.)

The monitor normally wakes every 30 seconds to notice changes in the set, like a new secondary being added. If send_message discovers that the primary is gone, it wakes the monitor early by signaling the event it's waiting on:

rsstate = self.rsstate
if not rsstate.primary:
    self.monitor.schedule_refresh()
    raise AutoReconnect()

No matter how many threads call schedule_refresh, the work is only done once.

Any MongoReplicaSetClient method that needs to block on refresh can wait for the "refreshed" event:

rsstate = self.rsstate
if not rsstate.primary:
    self.monitor.schedule_refresh()
    self.monitor.wait_for_refresh(timeout_seconds=5)

# Get the new state.
rsstate = self.rsstate
if not rsstate.primary:
    raise AutoReconnect()

# Proceed normally....

This pattern mitigates the connection storm from a heavily-loaded application discovering that the primary has changed: only the monitor thread goes looking for the new primary. The others can abort or wait.

The wasp's nest pattern is a simple and high-performance solution to some varieties of reader-writer problem. Compared to mutexes it's easy to understand, and most importantly it's easy to program correctly. For further reading see my notes in the source code.

Paper wasp and nest [Source]

Another Thing About Python's Threadlocals

As the maintainer of the connection pool for PyMongo, the official MongoDB driver for Python, I've gotten far more intimate knowledge of Python threads than I'd ever wanted. One of the challenges I face is: if the connect pool assigns a [...]

Dammit

As the maintainer of the connection pool for PyMongo, the official MongoDB driver for Python, I've gotten far more intimate knowledge of Python threads than I'd ever wanted.

One of the challenges I face is: if the connect pool assigns a socket to a thread and the thread dies, how do we reclaim the socket for the general pool? I thought I nailed it last year, using a weakref callback to a threadlocal, but there's a bug in that method. Justin Patrin of Idle Games discovered it while testing a PyMongo contribution he's making. I'm going to describe the bug, its impact, the cause, and the fix. I'll conclude by kvetching about supporting archaic versions of Python.

The Bug

Here's some code to start 1000 threads and register to be notified when they're kaput. At the end I assert no thread has died unmourned:

import threading
import weakref

nthreads = 10000
ncallbacks = 0
ncallbacks_lock = threading.Lock()
local = threading.local()
refs = set()

class Vigil(object):
    pass

def run():
    def on_thread_died(ref):
        global ncallbacks
        ncallbacks_lock.acquire()
        ncallbacks += 1
        ncallbacks_lock.release()

    vigil = Vigil()
    local.vigil = vigil
    refs.add(weakref.ref(vigil, on_thread_died))

threads = [threading.Thread(target=run)
           for _ in range(nthreads)]
for t in threads: t.start()
for t in threads: t.join()
getattr(local, 'c', None)  # Trigger cleanup in <= 2.7.0
assert ncallbacks == nthreads, \
    'only %d callbacks run' % ncallbacks

This is the method I presented in "Knowing When A Python Thread Has Died". Each thread creates a "vigil" object and sticks it in a threadlocal. Since only the threadlocal refers to the vigil, the vigil should be destroyed when the thread dies. I make a weakref to the vigil and register a weakref callback. If all goes well, the callback is run as the thread dies. A quirk of Python 2.7.0 or lesser is that the callback is run when the next thread accesses the threadlocal. This oddity is a consequence of Python Issue 1868, fixed by Antoine Pitrou in late 2010 and released in Python 2.7.1.

Note also that I synchronize ncallbacks += 1 with a mutex, since += is not atomic in Python. This innocent-looking mutex harbors a dark intent, as we shall soon discover.

In Python 2.7.1 and newer, the code above works as expected: ncallbacks is equal to 1000 immediately after all the threads are joined. In Python 2.7.0, ncallbacks should be 999 after the threads are joined, and then 1000 after the main thread does the final getattr to trigger cleanup.

The bug is: In Python 2.7.0 and older, ncallbacks is sometimes a few callbacks shy of a thousand. A few threads have been buried in unmarked graves....

Its Impact

I found that an application running Python 2.7.0 or older, if it creates and destroys very large numbers of threads continuously for a long time, and if each thread calls end_request at least once and start_request more times than end_request, will occasionally leave a socket tied to a dead thread. These sockets will eventually exceed the process's ulimit or MongoDB's.

This application pattern would be as weird and unusual as it sounds, which is why no one's complained of the bug.

The Fix

Once I'd written the test code above, I spent a few hours futzing with it—Dammit, I thought this worked! I tried various techniques to force Python 2.7.0 to run the callback a thousand times reliably. Late in the day a divine voice intoned, "synchronize assignment to the threadlocal." So I added a lock:

local_lock = threading.Lock()
# ...
    vigil = Vigil()
    local_lock.acquire()
    local.vigil = vigil
    local_lock.release()
    refs.add(weakref.ref(vigil, on_thread_died))

It worked! Now I was angrier. How can assigning to a threadlocal not be thread-safe?

The Cause

Let's again consider the example code above. The bytecode for assigning vigil to local.vigil is:

28 LOAD_FAST        1 (vigil)
31 LOAD_GLOBAL      3 (local)
34 STORE_ATTR       4 (vigil)

STORE_ATTR calls PyObject_SetAttr, which calls local_setattro, defined in Modules/threadmodule.c:

static int
local_setattro(localobject *self, PyObject *name, PyObject *v)
{
    PyObject *ldict;

    ldict = _ldict(self);
    if (ldict == NULL)
        return -1;

    return PyObject_GenericSetAttr((PyObject *)self, name, v);
}

At the highlighted line it calls _ldict. The _ldict function is, as I've known for some time, a pathetic piece of poo in Python 2.7.0 and older. Here's the turd, edited down a bit:

static PyObject *
_ldict(localobject *self)
{
    PyObject *tdict, *ldict;

    tdict = PyThreadState_GetDict();
    ldict = PyDict_GetItem(tdict, self->key);
    if (ldict == NULL) {
        ldict = PyDict_New(); /* we own ldict */

        PyDict_SetItem(tdict, self->key, ldict);
        Py_DECREF(ldict); /* now ldict is borrowed */
        if (i < 0)
            return NULL;

        Py_CLEAR(self->dict);
        Py_INCREF(ldict);
        self->dict = ldict; /* still borrowed */
    }

    /* The call to tp_init above may have caused
       another thread to run.
       Install our ldict again. */
    if (self->dict != ldict) {
        Py_CLEAR(self->dict);
        Py_INCREF(ldict);
        self->dict = ldict;
    }

    return ldict;
}

We haven't seen any use of the Py_BEGIN_ALLOW_THREADS macro, so one thread's had the GIL the whole time. Locking around the assignment shouldn't have any effect, right?

Well, take a look at the highlighted Py_CLEAR(self->dict) statement—there's the perpetrator. That statement gets the ldict of the last thread that accessed this threadlocal, swaps it with NULL and decrefs it. If this is the last reference to ldict (because the last thread has died) then decref'ing destroys it, and the weakref callback to vigil runs. The callback does ncallbacks_lock.acquire, which releases the GIL before trying to get the mutex.

So here's the kind of scenario I prevented by locking around assignment to the threadlocal:

  1. Thread A starts, assigns to the threadlocal, dies.
  2. Thread A's ldict is now the threadlocal's self->dict and has a refcount of 1.
  3. Thread B starts, begins assigning to the threadlocal, enters the _ldict function.
  4. _ldict sets self->dict to NULL and decrefs Thread A's ldict, which runs on_thread_died, which calls ncallbacks_lock.acquire and releases the GIL.
  5. Now Thread C starts, begins assigning to the threadlocal, enters _ldict.
  6. Thread C finds self->dict is NULL, increfs its own local ldict and assigns it to self->dict. It exits _ldict.
  7. Thread B resumes at Py_CLEAR(self->dict), increfs its own ldict and assigns it to self->dict.

Thread B has now replaced a pointer to Thread C's ldict with a pointer to its own, but it didn't decref Thread C's ldict first. (_ldict wasn't written to survive interruption during Py_CLEAR.) Thread C's ldict will never be destroyed, and a weakref callback to its vigil attribute will never be called.

Locking around assignment to the threadlocal prevents _ldict from running concurrently for any one threadlocal object, and prevents the refleak. In Python 2.7.1 and newer, the whole misguided self->dict system is removed from threadlocals and the lock's not needed.

This scenario applies to PyMongo's connection pool because the pool does need to acquire a lock in its weakref callback. Even if it didn't, there's a possibility of interruption whenever a thread is running Python code.

A Kvetch

This testing, the bug it revealed, the investigation, the fix: all this effort was spent to support entirely obsolete versions of Python. The Python core developers stopped maintaining them years ago, but PyMongo supports all Pythons going back to 2.4, mainly because there are "long-term support" Linux distros like Ubuntu and RHEL that once shipped with them. I have very savvy friends writing new applications on Python 2.6. Our children will have flying cars before we're done debugging these steam-powered versions of Python.

It's particularly frustrating because there's no point even filing bugs against Pythons before 2.7. "We fixed it," the developers will reply. "Upgrade." In Python 2.6, no one can hear you scream.

Review of "MongoDB Applied Design Patterns" by Rick Copeland

There's a lot of bad advice out there regarding MongoDB. As I wrote in my last review, even smart sources can encourage risky methods. Soon, I hope, there will be as much good MongoDB instruction from experts outside 10gen as there is good [...]

There's a lot of bad advice out there regarding MongoDB. As I wrote in my last review, even smart sources can encourage risky methods. Soon, I hope, there will be as much good MongoDB instruction from experts outside 10gen as there is good third-party SQL instruction. For now, know that you can trust Rick Copeland.

Copeland's new O'Reilly book on MongoDB complements O'Reilly's other five: the majestic Definitive Guide (due for a second edition in June), Scaling MongoDB, 50 Tips and Tricks, and the MongoDB books for Python and PHP.

After you've read the Definitive Guide, a good candidate for your second MongoDB book is Applied Design Patterns. (Disclosure: I was paid to critique an early draft.) Copeland's intended audience has basic MongoDB competence and wants application examples that optimize either for scalability or maintainability, plus the principles to guide new designs. Copeland also assumes basic SQL knowledge, and presents most examples in contrast to conventional SQL solutions, a method I find distracting and irrelevant. He identifies some common application types (product catalog, CMS, analytics, etc.) and provides for each a schema and application logic. He goes far beyond prior works when he discusses performance, consistency guarantees, and sharding considerations for every application.

MongoDB Applied Design Patterns

In Part 1, Copeland discusses the basic questions about MongoDB schemas. Right away, he identifies what makes nonrelational design different:

There is no longer a "garden path" of normalized database design to go down, and the go-to answer when faced with general schema design problems in MongoDB is "it depends".

MongoDB requires optimization up front, more often than SQL schema design does. (Armin Ronacher noticed this too a few months ago.) Most often the question is whether to embed or to link, and what data should be normalized or denormalized. Copeland uses an extensive description of disk seek times to explain the motivations for embedding and denormalization, better than prior MongoDB schema-design materials have.

Many presentations, my own included, have claimed that you can migrate your schema lazily with MongoDB: your application can start writing data in a new format, and read data in both new and old formats, while a batch job slowly migrates old data. MongoDB Applied Design Patterns finally presents a complete example of lazy migration, including example code (in Python) for reading data in both formats while the migration is in progress.

Without general-purpose transactions, MongoDB requires new techniques to guarantee that a series of changes is atomic: that is, to guarantee that in the long run your data either reflects all the changes or none of them. The simple approach is to put all related data in one document and use update operators to modify all the data in one shot. If there's no way to restrict your atomic operation to one document, your next best bet is optimistic concurrency control: try to complete the operation, check if another process overwrote your changes, and if so retry them. There are a number of examples of this in the wild (the MongoDB Manual, Dan Crosta, Scott Hernandez); Copeland's contribution is unusually complete, with example code for handling every case that can arise.

Part 2 of the book is much longer, and covers six kinds of application in depth, both conventional (a social network) and unusual (a role-playing game). Here Copeland excels. Where he covers well-tread ground his designs are more detailed and better thought out than prior authors', and where he innovates he chooses interesting problems to solve. In the Operational Intelligence chapter he explains compound indexes clearly and correctly. He presents a complete design for an analytics application using the MongoDB aggregation framework, and covers the interactions between aggregation, indexes, and sharding.

The final example of the book is an online Zork-style game. This is less widely applicable than E-Commerce or content management, but way more fun. Copeland chooses to radically denormalize his schema: when a player enters a room, the room's entire data structure is copied into the player's document so the game can display the player's state without querying for the room again. As with the other examples, this application is considered in depth: each query is carefully indexed, and when a player picks up an item, Copeland's code prevents another player from picking it up concurrently. Most of the game's intelligence is expressed in Python code rather than in MongoDB queries. Developers using Oracle or Microsoft SQL Server tend to push all the logic and complexity into their schema, their queries, and stored procedures. With MongoDB's simpler feature-set, coders have to move more logic out of the database and into their application. If a SQL refugee hasn't yet learned this lesson, the gaming chapter will drive it home.

Review of "Building Node Applications with MongoDB and Backbone"

Mike Wilson's O'Reilly book from December 2012 introduces some hip web development techniques by building a book-long example of a social networking app. Besides introducing MongoDB, Backbone, and Node, he shows the beauty and [...]

Mike Wilson's O'Reilly book from December 2012 introduces some hip web development techniques by building a book-long example of a social networking app. Besides introducing MongoDB, Backbone, and Node, he shows the beauty and remarkable concision of Jade, Require.js, and Mongoose. He demonstrates good patterns for organizing your code in an application of substantial complexity, covers a lot of ground in few pages, and concludes with an unusually feature-complete chat-server example that weaves together all the layers of the stack. Wilson has some dangerous habits readers shouldn't emulate, but on balance his book teaches well.

Building node applications

By necessity, the book jumps frequently between Node and Backbone, models and views, HTML and Javascript. It's the nature of web development that each new feature requires changes in many places, and it's hard to stay oriented. Wilson maintains a corrected version of each chapter's code on Github; use that instead of relying entirely on the examples in the book.

I've built one large front-end Javascript application with Backbone, and I floundered at organizing it. Although Backbone is rigorous (hence the name) about separating models and views, higher-level questions are underspecified: how should the code be split among files? Whose responsibility is it to create the models and views? Wilson uses Require.js to neatly slice code into files and to declare the dependencies among them. In his example application, the Backbone router is responsible for instantiating all models and views. As the book progresses and his example application grows, the routes, models, and views remain focused and decoupled. It's a compelling design. I wish I'd known.

Wilson spends an early chapter building a login system for his example app, before implementing any features. He even salts his password hashes to defend against rainbow tables. An author less secure in his convictions would fear losing his reader's attention, but Wilson insists on doing the right thing. And rightly so: readers will paste his examples and put them into production, so the examples should be complete.

On the other hand, Wilson's introduction to MongoDB misses some marks. It's only 12 pages, so why did he spend two of them on MapReduce? MapReduce has always been intended for big batch processes, not web applications. MongoDB books and talks have long over-emphasized MapReduce, which should be confined to a niche. The aggregation framework, on the other hand, is general-purpose and was released months before Wilson's book; it should have been covered instead.

Wilson also shows a MongoDB pattern that risks losing updates and is needlessly slow: When a user adds a contact in his social-networking site, Wilson's code fetches the whole user document, adds the contact, and saves the whole document back:

app.post('/accounts/:id/contact', function(req,res) {
  var accountId = req.params.id;
  var contactId = req.param('contactId', null);

  models.Account.findById(accountId, function(account) {
    models.Account.findById(contactId, function(contact) {
      models.Account.addContact(account, contact);
      account.save();
    });
  });

  // Note: Not in callback - this endpoint returns immediately and
  // processes in the background
  res.send(200);
});

(I've edited for brevity; the whole code is on GitHub.) Note that if two requests are updating the same account, the first one's updates are lost. $addToSet would have solved this, and would be more efficient too.

Equally worrisome is Wilson's tendency to drop errors on the floor instead of reporting them to the user, as shown at the bottom of this function. He argues "we are accepting the small but rare inconvenience in order to serve the majority of requests at an accelerated speed." This is a terrible argument for silencing errors, especially since the front-end framework needn't block the user from interacting with the UI while it waits for the server response.

A book like this seems intended to show best practices, and patterns that encourage correctness. Some of the hardest patterns to learn are error-handling in Node and concurrency control in MongoDB. I wish Wilson had devoted half the attention he placed on security to these two topics.

But I'm only mad at these flaws because the book they mar is a good one. As Wilson builds up his architecture piece by piece, the patterns appear both usable and elegant, and capable of staying clean as the app grows. Wilson uses Backbone custom named events like "app:loggedin" or "chat:start" to coordinate his front-end code, instead of letting views directly call methods on other views. A novice Backbone user might not see the tremendous value of decoupling views this way, but take it from me—it's a great idea.

The book concludes with a long chat example. Chat examples with Socket.io and Node are legion—indeed, obligatory—but the completeness of this one, including its integration with Backbone, is a tour de force. If you plan to use either Node or Backbone this book has excellent recommendations for structuring a large app, and even if you're not building with any of the frameworks Wilson covers, his examples can inspire you to write more concise and decoupled code.

Motor 0.1 Migration Instructions

Motor (which is indeed my non-blocking driver for MongoDB and Tornado) had a 0.1 release to PyPI yesterday. It had an odd history prior, so there are various versions of the code that you, dear reader, may have installed on your system. All [...]

Motor (which is indeed my non-blocking driver for MongoDB and Tornado) had a 0.1 release to PyPI yesterday. It had an odd history prior, so there are various versions of the code that you, dear reader, may have installed on your system. All you need to do is:

$ pip uninstall pymongo motor
$ pip install motor

Motor will pull in the official PyMongo, plus Tornado and Greenlet, as dependencies. You should now have Motor 0.1 and PyMongo 2.4.2:

>>> import pymongo
>>> pymongo.version
'2.4.2'
>>> import motor
>>> motor.version
'0.1'

(The lore is: I started Motor last year in a branch of my fork of PyMongo, so you could've installed an experimental version of both PyMongo and Motor from there. Then we transferred Motor into its own repo within the MongoDB.org organization on January 15. And on February 1st a zealous fan actually grabbed the "Motor" package name on PyPI and uploaded my code to it, then transferred ownership to me, just to make sure I could use the name Motor.)

Motor Officially Released

It's happened. Motor 0.1 is in PyPI. You can now install it with a simple: $ pip install motor This is the first official, production release of Motor. That said, there will be bugs: please file them and I'll respond as quickly as I can. Links: [...]

Motor

It's happened. Motor 0.1 is in PyPI. You can now install it with a simple:

$ pip install motor

This is the first official, production release of Motor.

That said, there will be bugs: please file them and I'll respond as quickly as I can.

Links:

Motor's Future

Motor is now feature-complete and fully tested. I expect to put it on the back burner and concentrate on other projects.

Motor will keep up easily with PyMongo development, because I designed it to. I don't intend for it to lag more than a smidge. For example, PyMongo 2.5 will bring some new security and authentication features; in the following Motor release I'll support those, too.

I believe this is the coolest thing I've ever made. I hope you have fun with it. Tweet me and let me know what you build with it.

A Curious Concurrency Case

Last month, the team in charge of 10gen's Ruby driver for MongoDB ran into a few concurrency bugs, reported by a customer running the driver in JRuby with a large number of threads and connections. I've barely written a line of Ruby in my [...]

Last month, the team in charge of 10gen's Ruby driver for MongoDB ran into a few concurrency bugs, reported by a customer running the driver in JRuby with a large number of threads and connections. I've barely written a line of Ruby in my life, but I jumped in to help for a week anyway.

I helped spot a very interesting performance bug in the driver's connection pool. The fix was easy, but thoroughly characterizing the bug turned out to be complex. Here's a record of my investigation.


The Ruby driver's pool assigns a socket to a thread when the thread first calls checkout, and that thread stays pinned to its socket for life. Until the pool reaches its configured max_size, each new thread has a bespoke socket created for it. Additional threads are assigned random existing sockets. When a thread next calls checkout, if its socket's in use (by another thread) the requesting thread waits in a queue.

Here's a simplified version of the pool:

class Pool
  def initialize(max_size)
    @max_size       = max_size
    @sockets        = []
    @checked_out    = []
    @thread_to_sock = {}
    @lock           = Mutex.new
    @queue          = ConditionVariable.new
  end

  # Check out an existing socket or create a
  # new socket if max_size not exceeded.
  # Otherwise, wait for the next socket.
  def checkout
    tid = Thread.current.object_id
    loop do
      @lock.synchronize do
        if sock = @thread_to_sock[tid]

          # Thread wants its prior socket
          if ! @checked_out.include?(sock)
            # Acquire the socket
            @checked_out << sock
            return sock
          end

        else

          if @sockets.size < @max_size

            # Assign new socket to thread
            sock = create_connection
            @thread_to_sock[tid] = sock
            return sock

          elsif @checked_out.size < @sockets.size

            # Assign random socket to thread
            sock = available[rand(available.length)]
            @thread_to_sock[tid] = sock
            return sock

          end

        end

        # Release lock, wait to try again
        @queue.wait(@lock)
      end
    end
  end

  # Return a socket to the pool.
  def checkin(socket)
    @lock.synchronize do
      @checked_out.delete(socket)
      @queue.signal
    end
  end
end

When a thread returns a socket, it signals the queue and wakes the next thread in line. That thread goes to the top of the loop and tries again to acquire its socket. The bug is in checkin: if the next thread in the queue is waiting for a different socket than the one just checked in, it may fail to acquire its socket, and it will sleep again.

When I first saw this I thought there must be the possibility of a deadlock. After all, if threads sometimes call checkin without really waking other threads, mustn't there come a time when everyone's waiting and no one has a socket?

I wrote a Python script to simulate the Ruby pool and ran it for a few thousand ticks, with various numbers of threads and sockets. It never deadlocked.

So I had to stop coding and start thinking.


Let's say there are N threads and S sockets. N can be greater than, less than, or equal to S. Doesn't matter. Assume the pool has already created all S sockets, and all N threads have sockets assigned. Each thread either:

  1. Has checked out its socket, and is going to return it and signal the queue, or
  2. Is waiting for its socket, or will ask for it in the future, or
  3. Has returned its socket and will never ask for it again.

To deadlock, all threads must be in state 2.

To reach that point, we need N - 1 threads in state 2 and have the Nth thread transition from 1 to 2. (By definition it doesn't go from state 3 to 2.) But when the Nth thread returns its socket and signals the queue, all sockets are now returned, so the next awakened thread won't wait again—its socket is available, so it goes to state 1. Thus, no deadlock.

The old code definitely wasn't efficient. It's easy to imagine cases where all a socket's threads were waiting, even though one of them could have been running. Let's say there are 2 sockets and 4 threads:

  1. Thread 1 has Socket A checked out, Thread 2 has Socket B, Thread 3 is waiting for A, Thread 4 is waiting for B, and they're enqueued like [3, 4].
  2. Thread 2 returns B, signals the queue.
  3. Thread 3 wakes, can't get A, waits again.

At this point, Thread 4 should be running, since its Socket B is available, but it's waiting erroneously for Thread 1 to return A before it wakes.

So we changed the code to do queue.broadcast instead of signal, so checkin wakes all the threads, and we released the fixed driver. In the future, even better code may prevent multiple threads from contending for the same socket at all.

The bugfix was obvious. It's much harder to determine exactly how bad the bug was—how common is it for a socket to be unused?


In my simulated pool there are 10 sockets. Each thread uses its socket for 1‑20 seconds, sleeps one second, and asks for its socket again. I counted how many sockets were in use each second, and subtracted that from S * total_time to get an inefficiency factor:

Percentage unused sockets

If N=S=10, threads never wait but there's some fake "inefficiency" due to the 1-second sleep. For larger numbers of threads the sleep time becomes irrelevant (because there's always another thread ready to use the socket), but signal adds an inefficiency that declines very slowly from 8% as the number of threads increases. A pool that uses broadcast, in contrast, can saturate its sockets if it has more than 30 threads.

I spent hours (mostly on planes) trying to determine why the inefficiency factor acts this way—why 8%? Shouldn't it be worse? And why does it fall, slowly, as N rises? But I'm calling it quits now. Leave a comment if you have any insights, but I'm satisfied that the old pool was wasteful and that the new one is a substantial improvement.

What It's Like To Work For 10gen

My colleague Kristina Chodorow wrote a post on working at 10gen with which I was a bit obsessed when I applied for a gig here in late 2011. Recently I've had an eventful couple weeks: I went to Miami, won a robot battle, and helped the Ruby [...]

My colleague Kristina Chodorow wrote a post on working at 10gen with which I was a bit obsessed when I applied for a gig here in late 2011. Recently I've had an eventful couple weeks: I went to Miami, won a robot battle, and helped the Ruby driver team fix some bugs. Seems like a good time to add to the "Working at 10gen" genre.


10gen, the MongoDB company, is as distributed as our database. We're spread around four continents, partly because we hired interesting people wherever they were, and partly to support our customers in their regions and time zones. Once a year everyone comes together, this time in Miami. My room had an acceptable vista:

IMG 0333

There was a hot tub on my balcony but I was the only one so equipped. Please don't tell my colleagues.

At our annual meetings we have nerdy contests. Last year, rocket building. This year, Lego robots that we programmed to push each other off a table.

The basic robot kit comes programmed to turn in a circle, and when it detects something in front of it, it charges. The 10gen teams came up with designs a little more sophisticated. Most bots had color-sensors pointed at the table's surface in front of them, so they could turn away from the edge without falling off. By far the cleverest hack I saw was this:

Bot

Photo: Gary Murakami

The robot has a black stripe taped to its front and rear shovels, which slips under the opponent's color-sensor. The opponent sees the stripe, thinks it's about to fall off the table, and turns away—and falls off the table.

Despite its brilliance, this subterfuge lost to my team's strategy. (Disclosure: I was doing customer support when they built the bot, so I take no credit.) We had a slow, powerful machine with a high shovel in front. Because it was high, the shovel tilted our opponents upward off their treads and robbed them of traction. It wasn't a very smart robot, but mechanics and brute force won the prize.

The kits were quite expensive, I hear. We tried not to bang them up too badly, or lose any parts, so we could donate them to a local middle school.

It's this combination of enjoyment with care-taking that I loved about the Miami meeting. We play like kids but we are not children. We never forgot that we're spending other people's money when we meet: aside from a few hours of robot-fighting, we spent our two and a half days in Miami holed up in conference rooms planning our future. Our event planner Samantha made it clear that we were not to spend any extra time in Miami on the company's dime. If we weren't on a plane home within a few hours of its ending, our expenses were our own. It sounds harsh but it's a mature attitude: we must take the greatest care with the capital entrusted to us.

The final day of our meeting, the Ruby driver team had a crisis. A customer reported that the driver was leaving cursors unclosed on replica-set members, because it sent the killCursors message to the wrong member. The Ruby team is normally four coders: Tyler Brock, Gary Murakami, Emily Stolfo, and Brandon Black. But Bernie Hackett and I from the Python team, and Jeff Yemin from Java, joined to take a look. Ruby Team Plus dug into the customer's reports and I learned that the driver was in a novel operating environment, and it was not thriving there. It was running in JRuby with a big thread pool, which exposed threading issues that had lain dormant for months. Not only was it leaving cursors open, but the JRuby BSON extension had concurrency bugs, and there was a strange performance degradation in the connection pool. I spent my time looking for connection-pool bugs and found a neat puzzler: I'm going to write about that in my next post.

Ruby Team Plus formed in a conference room in Miami, and drifted apart after we nailed down the last of the bugs by video-chat, nine intense days later. (Much more intense for the core Rubyists than for me.) No manager said, "You guys should help the Ruby team." We thought we could be useful, so we helped. I admire this about 10gen. I also admire that we piled in to help a customer without checking whether they had some Special Diamond Premium contract. They'd found bugs we needed to fix, so we worked nights and weekends to fix them. (And once they were fixed, we made it up to ourselves with some time off.)

That's what it's like to work for 10gen. I'm proud of us, and I'm having the time of my life.

Miami beach

Photo: Gary Murakami again

Motor Is Growing Up

For a long time I've thought that Motor, my non-blocking Python driver for MongoDB and Tornado, ought to be included as a module within the standard PyMongo package. Everyone both inside and outside 10gen has told me they'd prefer Motor be [ ... ]

Motor

For a long time I've thought that Motor, my non-blocking Python driver for MongoDB and Tornado, ought to be included as a module within the standard PyMongo package. Everyone both inside and outside 10gen has told me they'd prefer Motor be a separate distribution. Last week, I was suddenly enlightened. I agree!

(My argument for keeping Motor and PyMongo together was that changes in PyMongo might require changes in Motor, so they should be versioned and released together. But as Motor nears completion and I see the exact extent of its coupling with PyMongo, the risk of incompatibilities arising seems lower to me than it had.)

We completed the first step of the separation yesterday: We released PyMongo 2.4.2, the first version of PyMongo that includes the hooks Motor needs to wrap it and make it non-blocking.

The next step is to make a standalone distribution of Motor, and that's almost done, too. Motor has left its parent's house. It has:

And now, installing Motor is finally normal:

$ git clone git://github.com/mongodb/motor.git
$ cd motor
$ python setup.py install

Motor's not done yet, but it's heading to a 0.1 release in PyPI, as a standalone package, real soon now.