Requests in Python and MongoDB

If you use PyMongo, 10gen’s official MongoDB driver for Python, I want to ensure you understand how it manages sockets and threads, and I want to brag about performance improvements in PyMongo 2.2, which we plan to release next week.

The Problem: Threads and Sockets

Each PyMongo Connection object includes a connection pool (a pool of sockets) to minimize the cost of reconnecting. If you do two operations (e.g., two find()s) on a Connection, it creates a socket for the first find(), then reuses that socket for the second.

When sockets are returned to the pool, the pool checks if it has more than max_pool_size spare sockets, and if so, it closes the extra sockets. By default max_pool_size is 10.

What if multiple Python threads share a Connection? A possible implementation would be for each thread to get a random socket from the pool when needed, and return it when done. But consider the following code. It updates a count of visitors to a web page, then displays the number of visitors on that web page including this visit:

connection = pymongo.Connection()
counts = connection.my_database.counts
counts.update(
    {'_id': this_page_url()},
    {'$inc': {'n': 1}},
    upsert=True)

n = counts.find_one({'_id': this_page_url()})['n']

print 'You are visitor number %s' % n

Since PyMongo defaults to unsafe writes—that is, it does not ask the server to acknowledge its inserts and updates—it will send the update message to the server and then instantly send the find_one, then await the result. If PyMongo gave out sockets to threads at random, then the following sequence could occur:

  1. This thread gets a socket, which I’ll call socket 1, from the pool.
  2. The thread sends the update message to MongoDB on socket 1. The thread does not ask for nor await a response.
  3. The thread returns socket 1 to the pool.
  4. The thread asks for a socket again, and gets a different one: socket 2.
  5. The thread sends the find_one message to MongoDB on socket 2.
  6. MongoDB happens to read from socket 2 first, and executes the find_one.
  7. Finally, MongoDB reads the update message from socket 1 and executes it.

In this case, the count displayed to the visitor wouldn’t include this visit.

I know what you’re thinking: just do the find_one first, add one to it, and display it to the user. Then send the update to MongoDB to increment the counter. Or use findAndModify to update the counter and get its new value in one round trip. Those are great solutions, but then I would have no excuse to explain requests to you.

Maybe you’re thinking of a different fix: use update(safe=True). That would work, as well, with the added advantage that you’d know if the update failed, for example because MongoDB’s disk is full, or you violated a unique index. But a safe update comes with a latency cost: you must send the update, wait for the acknowledgement, then send the find_one and wait for the response. In a tight loop the extra latency is significant.

The Fix: One Socket Per Thread

PyMongo solves this problem by automatically assigning a socket to each thread, when the thread first requests one. The socket is stored in a thread-local variable within the connection pool. Since MongoDB processes messages on any single socket in order, using a single socket per thread guarantees that in our example code, update is processed before find_one, so find_one’s result includes the current visit.

More Awesome Connection Pooling

While PyMongo’s socket-per-thread behavior nicely resolves the inconsistency problem, there are some nasty performance costs that are fixed in the forthcoming PyMongo 2.2. (I did most of this work, at the direction of PyMongo’s maintainer Bernie Hackett and with co-brainstorming by my colleague Dan Crosta.)

Connection Churn

PyMongo 2.1 stores each thread’s socket in a thread-local variable. Alas, when the thread dies, its thread locals are garbage-collected and the socket is closed. This means that if you regularly create and destroy threads that access MongoDB, then you are regularly creating and destroying connections rather than reusing them.

You could call Connection.end_request() before the thread dies. end_request() returns the socket to the pool so it can be used by a future thread when it first needs a socket. But, just as most people don’t recycle their plastic bottles, most developers don’t use end_request(), so good sockets are wasted.

In PyMongo 2.2, I wrote a “socket reclamation” feature that notices when a thread has died without calling end_request, and reclaims its socket for the pool. Under the hood, I wrap each socket in a SocketInfo object, whose __del__ method returns the socket to the pool. For your application, this means that once you’ve created as many sockets as you need, those sockets can be reused as threads are created and destroyed over the lifetime of the application, saving you the latency cost of creating a new connection for each thread.

Total Number of Connections

Consider a web crawler that launches hundreds of threads. Each thread downloads pages from the Internet, analyzes them, and stores the results of that analysis in MongoDB. Only a couple threads access MongoDB at once, since they spend most of their time downloading pages, but PyMongo 2.1 must use a separate socket for each. In a big deployment, this could result in thousands of connections and a lot of overhead for the MongoDB server.

In PyMongo 2.2 we’ve added an auto_start_request option to the Connection constructor. It defaults to True, in which case PyMongo 2.2′s Connection acts the same as 2.1′s, except it reclaims sockets from dead threads. If you set auto_start_request to False, however, threads can freely and safely share sockets. The Connection will only create as many sockets as are actually used simultaneously. In our web crawler example, if you have a hundred threads but only a few of them are simultaneously accessing MongoDB, then only a few sockets are ever created.

start_request and end_request

If you create a Connection with auto_start_request=False you might still want to do some series of operations on a single socket for read-your-own-writes consistency. For that case I’ve provided an API that can be used three ways, in ascending order of convenience.

You can call start/end_request on the Connection object directly:

connection = pymongo.Connection(auto_start_request=False)
counts = connection.my_database.counts
connection.start_request()
try:
    counts.update(
        {'_id': this_page_url()},
        {'$inc': {'n': 1}},
        upsert=True)

    n = counts.find_one({'_id': this_page_url()})['n']
finally:
    connection.end_request()

The Request object

start_request() returns a Request object, so why not use it?

connection = pymongo.Connection(auto_start_request=False)
counts = connection.my_database.counts
request = connection.start_request()
try:
    counts.update(
        {'_id': this_page_url()},
        {'$inc': {'n': 1}},
        upsert=True)

    n = counts.find_one({'_id': this_page_url()})['n']
finally:
    request.end()

Using the Request object as a context manager

Request objects can be used as context managers in Python 2.5 and later, so the previous example can be terser:

connection = pymongo.Connection(auto_start_request=False)
counts = connection.my_database.counts
with connection.start_request() as request:
    counts.update(
        {'_id': this_page_url()},
        {'$inc': {'n': 1}},
        upsert=True)

    n = counts.find_one({'_id': this_page_url()})['n']

Proof

I wrote a very messy test script to verify the effect of my changes on the number of open sockets, and the total number of sockets created.

The script queries Mongo for 60 seconds. It starts a thread each second for 40 seconds, each thread lasting for 20 seconds and doing 10 queries per second. So there’s a 20-second rampup until there are 20 threads, then 20 seconds of steady-state with 20 concurrent threads (one dying and one created per second), then a 20 second cooldown until the last thread completes. My script then parses the MongoDB log to see when sockets were opened and closed.

I tested the script with the current PyMongo 2.1, and also with PyMongo 2.2 with auto_start_request=True and with auto_start_request=False.

PyMongo 2.1 has one socket per thread throughout the test. Each new thread starts a new socket because old threads’ sockets are lost. It opens 41 total sockets (one for each worker thread plus one for the main) and tops out at 21 concurrent sockets, because there are 21 concurrent threads (counting the main thread):

Pymongo 2.1

PyMongo 2.2 with auto_start_request=True acts rather differently (and much better). It ramps up to 21 sockets and keeps them open throughout the test, reusing them for new threads when old threads die:

Pymongo 2.2, auto_start_request=True

And finally, auto_start_request=False, PyMongo 2.2 only needs as many sockets as there are threads concurrently waiting for responses from MongoDB. In my test, this tops out at 7 sockets, which stay open until the whole pool is deleted, because max_pool_size is 10:

Pymongo 2.2, auto_start_request=False

Conclusion

Applications that create and destroy a lot of threads without calling end_request() should run significantly faster with PyMongo 2.2 because threads’ sockets are automatically reused after the threads die.

Although we had to default the new auto_start_request option to True for backwards compatibility, virtually all applications should set it to False. Heavily multithreaded apps will need far fewer sockets this way, meaning they’ll spend less time establishing connections to MongoDB, and put less load on the server.

Pausing with Tornado

Throwing this in my blog so I don’t forget again. The way to sleep for a certain period of time using tornado.gen is:

import tornado.web
from tornado.ioloop import IOLoop
from tornado import gen

class MyHandler(tornado.web.RequestHandler):
    @tornado.web.asynchronous
    @gen.engine
    def get(self):
        self.write("sleeping .... ")
        # Do nothing for 5 sec
        loop = IOLoop.instance()
        yield gen.Task(loop.add_timeout, time.time() + 5)
        self.write("I'm awake!")
        self.finish()

Simple once you see it, but for some reason this has been the hardest for me to get used to.

Review of “Version Control with Git” by Jon Loeliger

Version control with git

Git is the most powerful and conceptually elegant source code management system I’ve used. (Perhaps Mercurial rivals it? I haven’t used Mercurial.) But it seems to be in a state of arrested development. Many commands commonly used in ordinary development are basically unimplemented, and have to be performed with a set of lower-level commands. For example, publishing a local branch so remote developers can use it, and then setting up the branch so the remote copy continues to get updates, is a hard-to-memorize set of 3 commands, whereas it’s a no-brainer in Subversion.

My theory is that Linus Torvalds built the initial git as a set of low-level commands for managing versioned data in general, and intended higher-level, more convenient SCMs to be built on top of it. Since Linus had scratched his own itch, he left the higher-level implementation to others, but no one rose to the challenge. Now it’s too late—git is the default SCM for open-source projects, and so we’re stuck using low-level commands, or writing custom scripts for common tasks. (Or you can use Tower, like me.)

It’s as if we had reverted to programming in C. The newfound power is liberating, but it comes at a price. Whereas I have learned all previous SCMs casually (CVS, SVN, Perforce), learning git is like learning C. You won’t just pick it up. I’ve used it professionally for 4 years and I still flounder occasionally. To use it well, you have to understand it. You probably have to read a book.

Version Control with Git, by Jon Loeliger from 2009, is a good remedy. It introduces the reader to git’s object model (objects, trees, commits, refs, tags), and shows how git’s everyday commands use its “plumbing” commands to manipulate these basic materials. The book walks through detailed examples, including some pathologically-complex merges, and describes distributed development thoroughly.

If I have a nit to pick, it’s that the book’s discussion of distributed development is obsolete. In 2009, it may have been appropriate to spend a long chapter discussing how to email patches, and how to apply patches from an email. But these days, GitHub has obviated this process. In my experience, open-source developers, who need to review each other’s changes before applying them, use GitHub pull requests instead of git’s commands for managing patches over email. I hope a new edition will drastically cut the section on patches and add a discussion of GitHub’s collaborative features.

If you’re not ready to read a whole book, “Git from the bottom up” by John Wiegley provides some of the core concepts.

Review of Seamus Heaney’s translation of “Beowulf”

Beowulf

I’m going to gradually repost some reviews I wrote on Goodreads; hope you find them interesting.

I have three things to say about Beowulf.

1) The ballad itself is not only semi-foundational to English-language literature, but it’s short, bloody, and fast-paced. If you’re bored by Chaucer or Tolstoy or Proust or any of the other insufferably verbose classics you’re supposed to read, just read Beowulf. You’ll be free to say, “Fuck you, I’ve read Beowulf,” whenever you’re feeling insecure at the English majors’ party. Besides that, the core themes in Beowulf are subtle and beautiful: The need to gain honor while one lives, because death is always around the corner, and the inevitability of war, loss, and grief in a society where vengeance begets vengeance eternally.

(NB: I actually like Chaucer.)

2) Seamus Heaney’s translation is of variable accessibility and power. Sometimes he takes the Old English (which is a foreign language to us now) and translates awkwardly in order to preserve the alliterative structure of the original, as when Grendel attacks Beowulf:

… he was bearing in
with open claw when the alert hero’s
comeback and armlock forestalled him utterly.

I don’t know what metric or alliterative goals Heaney is trying to accomplish with these lines, but as an action sequence they’re failing.

Other times Heaney translates Old English into archaic English, using words like “torque”, “bawn”, or “thole” that are hardly more intelligible than the original.

But mostly, and redeemingly, Heaney writes with rough beauty and forthrightness, as when the poet reminds us that victory is always followed by sorrow:

Whoever remains
for long here in this earthly life
will enjoy and endure more than enough.

3) Most fantasy literature I know of owes some debt to Beowulf. Tolkien, for sure, with his strong handsome heroes, ancient monsters, and deceitful counselors sticks close to the vision of Beowulf. (Except the good guys win in Tolkien. In Beowulf, vengeance follows vengeance, forever and ever.)

Game of Thrones on HBO is very Beowulfy—conversations between characters begin with a recounting of the deeds of their fathers and grandfathers, with implicit comparisons to the present day.

Review of “Being Geek” by Michael Lopp

Being Geek

I’m always late to the party, but here’s my review of Michael Lopp’s 2010 book Being Geek.

A supposed career handbook, with little relevance to my career. The author has worked at large corporations (including Netscape) and small startups, but his idea of a small startup is 80 employees. In recent years, that’s my idea of a large company. He also assumes a kind of corporate culture that I hope is obsolete: The kind where you have a week to prepare for the Big Meeting, the kind where you live and die by PowerPoint. In my career, I never see slides.

Lopp advises the reader on job-searching, but it’s a style of search which I consider an illusion: You respond to a job listing with a resume carefully tweaked for the position, pass a phone screen, and interview on site for half a day. When I left college I, too, thought this was how people applied for jobs, but in my experience it’s a sign of desperation if you resort to such measures. If you want to apply to a dozen companies and get nowhere, submit your resume. If you want to work, email your friends. Lopp’s advice might apply to those fresh out of college, with no contacts, looking for long-term employment in giant corporations. I’ve never been such a person, and I never meet them either. If you’re freelancing, or working for a startup, then reading this book is useful not for its specific advice, but simply as an opportunity to spend a few hours considering your own situation. It’s nothing like his, but in considering the distinction you may clarify your status and your goals.

In Response to “Stop Looking For A Technical Co-founder”

Alexey Komissarouk, apparently a very savvy CS senior at UPenn, has written Stop Looking For A Technical Co-founder. He’s criticizing the phenomenon that I’ve found epidemic in NYC: some business-type, I’ll call him Mr. MBA, usually with no money or software-development expertise, has some idea and goes around looking for a “technical co-founder,” i.e., someone who will implement the idea.

It’s easy for programmers like us to make fun of Ms. MBA. One of the reasons is that many such people are naïve and useless. A bad MBA (the common breed) will wander from Meetup to Meetup searching for a technical co-founder and, in the best case, will find no takers. I’ve worked in a number of startups, and a good businessperson has something to contribute to software startups—he or she will deal with budgeting and advertising and talking to investors, things we don’t want to do and aren’t good at. In the best cases, I’ve seen sharp businesspeople teamed with hackers into unstoppably awesome archons.

Komissarouk thinks that Ms. MBA should learn to code, or hire an external team, or simplify her software requirements by assembling most of it from existing services.

I don’t think any of these recommendations is sufficient to save the naïve aspiring software tycoon from calamity.

Learn to code. Sure, everyone should learn to code for the same reasons they should learn statistics and how a bill becomes a law. But it won’t save Mr. MBA. Unless he has a natural, undiscovered talent for coding, he will not be able to compete with the professionals. And unfortunately for him, he will be competing with them: if his idea is any good, then there are hundred other people who have had the same idea at the same time. Many of those people are excellent software developers, or they know investors with money, or both. Consider the invention of the lightbulb, the telephone, and Facebook: these ideas arose in environments lousy with similar ideas; it’s just that Edison, Bell, and Zuckerberg were the smartest and luckiest among the people who had those ideas. For Mr. MBA, learning to code is better than not learning, but it won’t be enough to beat all the great coders who have already started working on the same idea.

Hire an external team. Assuming Ms. MBA has a lot of money somehow, she can just build a team to write the software. This is the best of the recommendations, because it dissuades Ms. MBA from her most common delusion: that her idea is worth about half the eventual company’s value, and so a good coder should join her and write all the code in exchange for equity. Komissarouk is right that someone with an idea who can’t implement it must be willing to pay someone else, in cash, now, to build it. The problem is, paying a team to build her idea puts Ms. MBA at a disadvantage competing with those who can code, who are working on that idea too: she has to spend more, and she can’t lead her team as well.

Reuse existing services. Yes, one should definitely use existing services where possible. But it’s not just business-types who can reuse services. Coders can play this game, too, as they sprint toward their launch. And they’ll do it more intelligently than Mr. MBA can do it.

What’s my recommendation? People shouldn’t try to found businesses in realms in which they lack expertise. If Mr. and Ms. MBA can’t code, chances are they won’t succeed in the software industry. Maybe they have deep knowledge about a problem in baseball or microlending that could be solved by software—in such a case the MBA’s expertise could be enough competitive advantage to succeed.

Or maybe Mr. and Ms. MBA don’t need to go into the software business at all. Maybe they know about farming, or solar power, or sewage—the world needs innovation in all those areas so much more than it needs another clone of Foursquare or Farmville. Please, business majors—if you’re not a very good software developer, turn your back on the software gold-rush. Learn about some tangible good that cries out for improvement, and go invent a better one.