Undoing Gevent’s monkey-patching

Update

I’m a genius: simply executing reload(socket) undoes Gevent’s patch_socket(). Obviously, this only applies to new sockets created after executing reload, but that’s good enough for my unittests. The dumb solution below is preserved for hysterical porpoises.

Prior

I ran into an odd problem while testing the next release of PyMongo, the Python driver for MongoDB which I help develop. We’re improving its support for Gevent, so we’re of course doing additional tests that begin with:

from gevent import monkey; monkey.patch_socket()

Now, some tests rely on this patching, and some rely on not being patched. Gevent doesn’t provide an unpatch_socket, so I had a clever idea: I’ll fork a subprocess with multiprocessing, do the test there, and return its result to the parent process in a multiprocessing.Value. Then subsequent tests won’t be affected by the patching.

SUCCESS = 1
FAILURE = 0

def my_test(outcome):
    from gevent import monkey; monkey.patch_socket()
    # do the test ....
    outcome.value = SUCCESS

class Test(unittest.TestCase):
    def test(self):
        outcome = multiprocessing.Value('i', FAILURE)
        multiprocessing.Process(target=my_test, args=(outcome,)).start().join()
        self.assertEqual(SUCCESS, outcome.value)

Nice and straightforward, right? In sane operating systems this worked great. On Windows it broke horribly. When I did python setup.py test, instead of executing my_test(), multiprocessing on Windows restarted the whole test suite, which started another whole test suite, … Apparently, since Windows can’t fork(), multiprocessing re-imports your script and attempts to execute the proper function within it. If the test suite is begun with python setup.py test, then everything goes haywire. This problem with multiprocessing and unittests on Windows was discussed on the Python mailing list last February.

After some gloomy minutes, I decided to look at what patch_socket() is doing. Turns out it’s simple, so I wrote a version which allows unpatching:

def patch_socket(aggressive=True):
    """Like gevent.monkey.patch_socket(), but stores old socket attributes for
    unpatching.
    """

    from gevent import socket
    _socket = __import__('socket')

    old_attrs = {}
    for attr in (
        'socket', 'SocketType', 'create_connection', 'socketpair', 'fromfd'
    ):
        if hasattr(_socket, attr):
            old_attrs[attr] = getattr(_socket, attr)
            setattr(_socket, attr, getattr(socket, attr))

    try:
        from gevent.socket import ssl, sslerror
        old_attrs['ssl'] = _socket.ssl
        _socket.ssl = ssl
        old_attrs['sslerror'] = _socket.sslerror
        _socket.sslerror = sslerror
    except ImportError:
        if aggressive:
            try:
                del _socket.ssl
            except AttributeError:
                pass

    return old_attrs


def unpatch_socket(old_attrs):
    """Take output of patch_socket() and undo patching."""
    _socket = __import__('socket')

    for attr in old_attrs:
        if hasattr(_socket, attr):
            setattr(_socket, attr, old_attrs[attr])


def patch_dns():
    """Like gevent.monkey.patch_dns(), but stores old socket attributes for
    unpatching.
    """

    from gevent.socket import gethostbyname, getaddrinfo
    _socket = __import__('socket')

    old_attrs = {}
    old_attrs['getaddrinfo'] = _socket.getaddrinfo
    _socket.getaddrinfo = getaddrinfo
    old_attrs['gethostbyname'] = _socket.gethostbyname
    _socket.gethostbyname = gethostbyname

    return old_attrs


def unpatch_dns(old_attrs):
    """Take output of patch_dns() and undo patching."""
    _socket = __import__('socket')

    for attr in old_attrs:
        setattr(_socket, attr, old_attrs[attr])

In Gevent’s version, calling patch_socket() calls patch_dns() implicitly, in mine you must call both:

class Test(unittest.TestCase):
    def test(self):
        old_socket_attrs = patch_socket()
        old_dns_attrs = patch_dns()

        try:
            # do test ...
        finally:
            unpatch_dns(old_dns_attrs)
            unpatch_socket(old_socket_attrs)

Now I don’t need multiprocessing at all.

Tornado Unittesting With Generators

Intro

This is the second installment of what is becoming an ongoing series on unittesting in Tornado, the Python asynchronous web framework.

A couple months ago I shared some code called assertEventuallyEqual, which tests that Tornado asynchronous processes eventually arrive at the expected result. Today I’ll talk about Tornado’s generator interface and how to write even pithier unittests.

Late last year Tornado gained the “gen” module, which allows you to write async code in a synchronous-looking style by making your request handler into a generator. Go look at the Tornado documentation for the gen module.

I’ve extended that idea to unittest methods by making a test decorator called async_test_engine. Let’s look at the classic way of testing Tornado code first, then I’ll show a unittest using my new method.

Classic Tornado Testing

Here’s some code that tests AsyncMongo, bit.ly’s MongoDB driver for Tornado, using a typical Tornado testing style:

    def test_stuff(self):
        import sys; print >> sys.stderr, 'foo'
        db = asyncmongo.Client(
            pool_id='test_query',
            host='127.0.0.1',
            port=27017,
            dbname='test',
            mincached=3
        )

        def cb(result, error):
            self.stop((result, error))

        db.collection.remove(safe=True, callback=cb)
        self.wait()
        db.collection.insert({"_id" : 1}, safe=True, callback=cb)
        self.wait()

        # Verify the document was inserted
        db.collection.find(callback=cb)
        result, error = self.wait()
        self.assertEqual([{'_id': 1}], result)

        # MongoDB has a unique index on _id
        db.collection.insert({"_id" : 1}, safe=True, callback=cb)
        result, error = self.wait()
        self.assertTrue(isinstance(error, asyncmongo.errors.IntegrityError))

Full code in this gist. This is the style of testing shown in the docs for Tornado’s testing module.

Tornado Testing With Generators

Here’s the same test, rewritten using my async_test_engine decorator:

    @async_test_engine(timeout_sec=2)
    def test_stuff(self):
        db = asyncmongo.Client(
            pool_id='test_query',
            host='127.0.0.1',
            port=27017,
            dbname='test',
            mincached=3
        )

        yield gen.Task(db.collection.remove, safe=True)
        yield gen.Task(db.collection.insert, {"_id" : 1}, safe=True)

        # Verify the document was inserted
        yield AssertEqual([{'_id': 1}], db.collection.find)

        # MongoDB has a unique index on _id
        yield AssertRaises(
              asyncmongo.errors.IntegrityError,
              db.collection.insert, {"_id" : 1}, safe=True)

A few things to note about this code: First is its brevity. Most operations and assertions about their outcomes can coëxist on a single line.

Next, look at the @async_test_engine decorator. This is my subclass of the Tornado-provided gen.engine. Its main difference is that it starts the IOLoop before running this test method, and it stops the IOLoop when this method completes. By default it fails a test that takes more than 5 seconds, but the timeout is configurable.

Within the test method itself, the first two operations use remove to clear the MongoDB collection, and insert to add one document. For both those operations I use yield gen.Task, from the tornado.gen module, to pause this test method (which is a generator) until the operation has completed.

Next is a class I wrote, AssertEqual, which inherits from gen.Task. The expression

 yield AssertEqual(expected_value, function, arguments, ...)

pauses this method until the async operation completes and calls the implicit callback. AssertEqual then compares the callback’s argument to the expected value, and fails the test if they’re different.

Finally, look at AssertRaises. This runs the async operation, but instead of examining the result passed to the callback, it examines the error passed to the callback, and checks that it’s the expected Exception.

Full code for async_test_engine, AssertEqual, and AssertError are in this gist. The code relies on AsyncMongo’s convention of passing (result, error) to each callback, so I invite you to generalize the code for your own purposes. Let me know what you do with it, I feel like there’s a place in the world for an elegant Tornado test framework.

Video, Slides, and Code About Async Python and MongoDB

Video is now online from my webinar last week about Tornado and MongoDB. Alas, I didn’t make the text on my screen big enough to be easily readable in the low-res video we recorded, so it’ll be a little fuzzy for you. (Live and learn.) No worries, the slides are here in full-res glory and the example code is on GitHub. It’s a trivial Twitter clone called “chirp” which demonstrates using a MongoDB capped collection as a sort of queue. The demo uses Tornado, a MongoDB tailable cursor, and socket.io to stream new “chirps” from the capped collection to clients. I’ve implemented the same demo app three times:

Generosity

Screen Shot 2012-03-02 at 12.59.22 AM.png

I’m taking the 10-month Path of Practice class at the Village Zendo. It’s based on the Ten Paramitas, or “Perfections,” a list of qualities that Buddhists should encourage in themselves, so we’re more useful to others and grow our wisdom.

We’re starting with Dana Paramita, the virtue of Generosity (same root as “donation”), and I’ll share my reflections on it here.

•••

I’m most generous to things, not to people. The consequences of my work may benefit people, but the way it feels to me, I’m motivated to improve or fix or create a thing. If that makes life better for others that’s great, but it isn’t the reason I do it.

I work as a programmer for 10gen, a startup developing database software. Part of my work is writing code, and part of it is providing support for customers. Early this week a customer complained to us that some records in their database had become corrupted and couldn’t be parsed, probably because of some transient hardware problem. The problem wouldn’t recur, but they really wanted those dozen records repaired. They have hundreds of thousands of users, and hundreds of millions of records, but these dozen records were broken and the customer wanted them fixed.

It probably would have been ok to say, “Sorry, those records are gone.” Or at least, “We’ll see if we can recover them some time soon.” But I worked until 11 that night, and started again the next morning, diving into each record and examining it bit by bit, finding the 1s that should have been 0s and the 0s that should have been 1s. I had a sense of urgency, and irritation, that the data could be fixed, but I hadn’t done it yet. There is no describing my relief when I was finished. It’s one of the most satisfying things I’ve done.

I ran into my teacher, Enkyo Roshi, while I was buying lunch at Whole Foods. I described what I’d spent the last 8 hours doing and she said, “You’re really deep in there. It must be like a body.” It was like a body. I had to feel my way through the numbers.

That work was a great generosity, but it didn’t occur to me at the time that I was being generous. And it probably didn’t look like generosity. At the moments when I was giving the most, I was simultaneously drinking wine, playing techno, fixing the bits, and cursing the customer directly over instant messenger. (He’s an old friend.) It didn’t look like generosity because my compassion wasn’t toward the customer, it was toward the data itself.

•••

When my teacher named me Jiryu, she explained to me what the Chinese characters meant. “Ji” is maintaining, or fixing, and “Ryu” is a flow or a canal. She said my name connotes the person who maintains the irrigation canals in a rice field. I love this. Sometimes people say my name means “healing flow,” but that sounds hippy and sentimental to me, and sort of menstrual. “Healing” is not inspiring to me, not like “fixing” is. I want to be a fixer.

•••

My friend Eisho gave a talk at the Zendo tonight about this koan:

Yunyan asked Daowu, “How does the Bodhisattva Guanyin use those many hands and eyes?”

Daowu answered, “It is like someone in the middle of the night reaching behind her head for the pillow.”

Yunyan said, “I understand.”

Daowu asked, “How do you understand it?”

Yunyan said, “All over the body are hands and eyes.”

Daowu said, “That is very well expressed, but it is only eight-tenths of the answer.”

Yunyan said, “How would you say it, elder brother?”

Daowu said, “Throughout the body are hands and eyes.”

—Blue Cliff Record, Case 89
Translated by Joan Sutherland and John Tarrant

“Hands and eyes” describes Avalokiteshvara, the bodhisattva of compassion, whose hundreds of hands each has an eye on the palm, so he or she can see the suffering of all beings, and respond. The point of the koan, in my humble opinion, is two-fold: first, that the most effective generosity is an immediate response to a need, the way we adjust a pillow when we’re uncomfortable. It’s not like signing up for a blood drive so I feel like a good person. When the pillow’s out of place, there’s an urgency and irritation about fixing it now. The other point about the body is deeper and I will not try to put it in words tonight, or any time soon.

•••

I think there’s a wide variation in what motivates people to be generous. Some people are probably satisfied by seeing a need in other people and fulfilling it. It’s less so for me. I’ve certainly done generous things for people this month, like helping a friend move, or paying for dinner, or meditating with prisoners at Sing Sing, but that kind of generosity isn’t the strongest urge for me and it’s not where I spend most of my time. Rather, it’s when I have an idea that I want to make real, or when something’s broken that I can fix, that I work the hardest and longest.

Career Fairs, Part 2: How Can Startups Get Noticed?

I wrote the other day about what I think Comp Sci majors are doing wrong at career fairs and how they should be distinguishing themselves from their peers. There’s a fun debate in the comments about whether I gave the right advice. Regardless, here’s a followup question I need answered from CS undergrads:

If you’ve been to a career fair, what did startups do wrong? How can we get you to notice us?

When I consider how we at 10gen set ourselves up at Big Ivy University’s career fair, I don’t think we did any better than the students did. We displayed our logo and the name of our product MongoDB, and … that’s it. I can’t blame the hundred kids who came up to our table and said, “What do you do?” We should say why an intership with us will be awesome, e.g.:

  • The NoSQL movement is one of the most innovative areas in software these days, and we dominate it.
  • We’re small, so if you’re smart you can make a big relative contribution.
  • We’re run by and for coders: Our CEO codes, I code, our customers code, everyone codes.
  • We’re on the kind of growth trajectory that eventually makes household-name companies.

I don’t know how to say these things convincingly, especially not on a poster, so that a smart undergrad who’s never heard of us will stop at our table. Suggestions?

So You’re Coming to a Career Fair

I went to a career fair at Big Ivy University recently, and talked to fifty or so computer science undergrads who were looking for internships or full-time jobs with my employer, 10gen. I’m sure some of them were very smart, but they had not learned how to distinguish themselves from each other. One after another, these students came with identical resumes, identical suits, and identical pitches about why they should get a gig with us.

CS students, I want to tell you how to stand out when you’re introducing yourself at a career fair. If you’re an extraordinary hacker, you need to tell us that you are, and you need to show that you are on your resume. Otherwise we can’t find you.

What You Learned In School Is Not Enough

The first student I met at BIU handed me her resume, and I saw that she knew Haskell, and she’d done a machine-learning project. I thought, “cool,” and put the resume in the “call this candidate” pile. The third time I saw Haskell and machine learning, I realized that’s just what they teach at Big Ivy.

If you’re competing with students from other schools, then your coursework may be an advantage or a disadvantage. But if you’re coming to a career fair, you’re competing with kids who took the same courses as you. So I’m not impressed that you learned Haskell as a freshman—you’d have been kicked out of the program if you hadn’t.

One possibility too terrifying to contemplate is that the students I met at BIU thought their GPA mattered. If so, they’re in for a rude surprise. I know they all listed their GPAs on their resumes, but I forgot to look, and I think most employers will forget to look at GPA, as well.

Charisma Matters

It’s a shame, but it’s true: a firm handshake, eye contact, and a calm, friendly, enthusiastic manner make a big difference, even for nerds. I will spend more time with you, even though there are five kids in line behind you, and I will answer your questions better and ask you more questions. It’s not just that I’m biased towards charismatic people. Your social skills are part of what my company wants to hire. In the long run, if you work for us, you’ll be making friends with your coworkers, talking to customers, and presenting our products at conferences. We need you to be engaging.

Individual Projects, Unusual Languages, Unusual Courses

Look, if you’re graduating with a CS major, you will get a job. Relax. The market’s great. But if you actually care about software and want to work somewhere that excites you, you’ll need to put some effort into your resume and how you introduce yourself. Here’s what I want to see:

Individual Projects: 100 bonus points each

If you had an idea for a software project and you implemented it, then you should put that at the top of your resume. Above your name. And tell me about that project as soon as you shake my hand at the career fair. The project doesn’t have to be totally unique, or profitable, or complete—just make something. Then I’ll know you have cool ideas for things to build, and that you love coding, which is highly correlated with being great at coding. You’re in the “call back” pile.

If you haven’t built an individual project, start. Let your 4.0 GPA slip a little. It’s worth it to make time for this project. Don’t worry about getting college credit for the time you spend, just build it. Put the GitHub URL on your resume so I can check it out.

Extra Languages: 25 bonus points each

If the only programming languages on your resume are the required ones, then you’re showing me you do your homework. It’s not enough. Learn an extra language. It doesn’t have to be anything exotic like Erlang, just something all your peers didn’t learn in class. Put this at the top of your resume, under the individual project. Tell me about how you taught yourself C++ over summer break because you want to do 3D graphics for a living. It doesn’t matter if I’m not looking for a C++ programmer, you’re showing me you love learning about computers. But be aware that I may know this language, too, so if you claim you’re an “expert,” you better be for real.

Unusual Courses: 10 bonus points each

I know Big Ivy offers a computer graphics class, but it seems like only one student took it. All the rest just listed the same boring courses on their resumes: Operating Systems, Networking, blah blah blah. I know you took those courses; otherwise you wouldn’t be graduating. If you want me to notice you, take lots of electives. Again, your GPA doesn’t matter, so don’t worry about getting a little overloaded.

Longshots

Contributing to Open Source

I don’t recommend that undergrads go on GitHub seeking an open source project to contribute to. It’ll be different once you’ve been working for a few years, but right now you probably don’t have any itches that aren’t well-scratched by an existing project. Even if you do, I doubt you’re ready to write a patch that’s high-quality enough to be accepted. It’s much easier to start a new project on your own. For one thing, when you work on your own project, no one has to approve your patch.

Possible exceptions to this rule: Porting a package to Python 3 if no one else has started it; porting a package from a popular language to an exotic one if there’s no analogous library in the target language.

Freelancing

Your internships for other software companies are great, but I don’t recommend freelance work. It would probably be along the lines of setting up a WordPress site for your friend’s mother’s law firm. The level of sophistication required for your first real gig is going to stomp all over whatever summer job you get, so unless you really need the money, put your time into an individual project instead.

Third Normal Form and Ultimate Truth

I have an opinion: most people learned about relational databases as if RDBMSes were designed to store the ultimate truth about some data. They figured that once the schema had been properly diagrammed and normalized, then they could load all their data into it, and finally, start doing some queries.

To pick on an easy target, look at Wikipedia’s article on schema design. It summarizes the two steps a designer must take:

  1. Determine the relationships between the different data elements.
  2. Superimpose a logical structure upon the data on the basis of these relationships.

Do you see a step that’s missing? If you’ve deployed and maintained a large-scale application you’ll probably see what the Wikipedia authors omitted. In fact, it’s the first step: Figure out what one question your database must answer. Then, design your schema to answer that question as fast as possible. And now you’re done. Come to think of it, you never had to do steps 1 and 2 at all.

There’s a total disconnect between the approaches of introductory SQL courses and real-world application development, and I think this disconnect is slowing down adoption of NoSQL.

Consider Facebook Messages. After a (now rather well-publicized) evaluation process, Facebook chose HBase, a NoSQL data store, as the main database for their message system. I haven’t talked to anyone there, but I figure they chose it based on this criterion:

How fast will our database answer the question, “What are this user’s most recent 10 messages?”

They chose the database system that could answer that question the fastest, and they designed the best schema they could think of to answer that question. Anything else they need to ask HBase may be slow, or difficult, but that doesn’t matter, because “What are this user’s most recent 10 messages?” probably accounts for 99% of the load on their system.

If you learned about databases in college, following some textbook, I expect you were guided through a long process of modeling real-world data using rows and columns, to express some profound truth about the data. Then, you were introduced to SQL, with which you could query the data. At the end of the course, maybe there was a brief discussion of database performance. Probably not.

Data at the scale that the largest websites handle doesn’t work that way. Large applications design their schemas to answer one question as quickly as possible, and no other considerations are significant.

The next time you read about a NoSQL database you might wonder, “What about foreign keys, or normalization? What about transactions? Why can’t I define secondary indexes? Why are range queries prohibited?” (I’m just picking some limitations at random—each system is different.) Consider who built these new database systems, and what their experience has been. The ideas behind NoSQL databases mostly originated at places like Google, Amazon, and Yahoo. They build huge systems, and huge systems’ loads are usually dominated by a handful of queries. Companies build their database systems from the ground up to optimize the performance of these queries. NoSQL databases encourage you to figure out ahead of time, “What one question do I need to answer?” Figure that out, and choose your database software and your schema based on that. Nothing else really matters.

Philly MongoDB User Group: Python, MongoDB, and Asynchronous Web Frameworks

Philadelphia Panorama From Camden
Photo (C) Parent5446

I’ll be recapping last week’s talk on Python, MongoDB, and Asynchronous Web Frameworks this Thursday at 7pm, in Philadelphia, at the Philly MongoDB User Group’s inaugural meetup. We’ll be at the Devnuts office, at 908 North 3rd Street. We’ll have pizza, naturally.

First Philly MongoDB User Group

This Thursday: a talk on Python, MongoDB, and asynchronous web frameworks

MongoDB Logo

This Thursday in NYC I’m talking about Python, MongoDB, and asynchronous web frameworks at a meetup called For the Love of Python: Wine tasting, Red velvet cupcakes, and Tech Talks. The talk is a work in progress. To be strictly accurate, I have not yet started working on the talk, because the code I’ll be talking about is itself a work in progress. But come anyway, because I’ve been thinking a lot on this subject for the last few months, and I intend to present:

  • A high-level discussion of what an async web framework is and when you need it, or don’t. I think there’s a lot of sloppiness on this subject, and I want to work with the audience on tightening up our thinking.
  • A review of pymongo, pthreads, Tornado, asyncmongo, and gevent. You won’t be disappointed.
  • For the first time ever, I will present an exclusive sneak-peak at my own experimental Python driver for MongoDB and Tornado, built on top of the official pymongo driver. It’s pretty snazzy, it uses greenlets, and it’s an example of a general pattern for asynchronizing synchronous database drivers that might inspire you to write your own database driver in Python. Buckle your seatbelts, we’re going deep.